GTB-tbsequencing / mutation-catalogue-2023

MIT License
12 stars 1 forks source link

Request for Missing Files: Replicating Bioinformatic Pipeline for Tuberculosis Drug Resistance Analysis #12

Open muguading opened 1 month ago

muguading commented 1 month ago

Hello, I'm glad to see that this repository has provided the annotated results table for the second version of tuberculosis drug resistance, which is highly relevant to our current work. Our research group is also interested in conducting drug resistance identification on our own samples following the testing protocol provided here. However, while replicating the Bioinformatic pipeline - Step Functions process, we noticed that some files were not provided, such as the reference folder and certain scripts under the script folder. Could you provide a compressed file containing these corresponding files for download? Thank you for taking the time to read and consider our request. We greatly appreciate your efforts and contributions in assisting us with this important work. We look forward to your response and hope to obtain the necessary files as soon as possible. Once again, thank you for your support!

sachalau commented 1 month ago

Thank you for your report.

We did not provide reference sequences as we are using a separate step function process to fetch these sequences automatically via the NCBI API. I'll add the definition of that step function as well, but basically everything we did is based on https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000195955.2/

It's correct that many of the scripts we have used during processing were not included in the repository, however, considering that I'll have to go through them and clean them up a little bit for readability purposes, could you please let me know which one you would be most interested in?

muguading commented 1 month ago

Thank you for your report.

We did not provide reference sequences as we are using a separate step function process to fetch these sequences automatically via the NCBI API. I'll add the definition of that step function as well, but basically everything we did is based on https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000195955.2/

It's correct that many of the scripts we have used during processing were not included in the repository, however, considering that I'll have to go through them and clean them up a little bit for readability purposes, could you please let me know which one you would be most interested in?

I noticed that in the JSON workflow, there are some scripts under the script directory, such as format all genotypes.sh, calculate_quality_metrics_for_sample.py, format_sequencing_stats.py, format_taxonomy_assignment.py, and so on. Additionally, when running mosdepth, a bed file is used after the --by parameter. I found that these files are missing from the workflow you provided, making it difficult to reproduce the process. Consequently, it's challenging to obtain consistent standards for drug resistance loci. Could you please provide some guidance on how to obtain these files or where to find them? Thank you for your assistance!

sachalau commented 1 month ago

Dear @muguading

I'm sorry for the delay. I have added the scripts you have mentioned. The bed file is generated by a script gff_to_bed.py which I have also added.

However, I must stay that I think replicating all our bioinformatic process using the current state of the Github is a very ambitious endeavor and I think would require a lot of work.

On the other hand, we have re implemented our bioinformatic workflow using terraform so that it can be readily redeployed to any new environment (however it's solely based on AWS resources). I'm afraid we haven't open source'd this solution for the moment, but is it something you would be interested in that were the case?

muguading commented 4 weeks ago

Dear @muguading

I'm sorry for the delay. I have added the scripts you have mentioned. The bed file is generated by a script gff_to_bed.py which I have also added.

However, I must stay that I think replicating all our bioinformatic process using the current state of the Github is a very ambitious endeavor and I think would require a lot of work.

On the other hand, we have re implemented our bioinformatic workflow using terraform so that it can be readily redeployed to any new environment (however it's solely based on AWS resources). I'm afraid we haven't open source'd this solution for the moment, but is it something you would be interested in that were the case?

I am delighted to receive your reply. I have seen the script you uploaded, and I really appreciate it. The reason I need to set up the process on my end is that I have a large number of samples to analyze. Additionally, due to the confidentiality of the sample data, it is not suitable to upload them to a cloud server for analysis.

muguading commented 3 weeks ago

Hello, I hope this message finds you well. Following the workflow and scripts you provided earlier, I have generated mutation CSV files from freebayes, bcftools, and GATK, as well as structural variation CSV files identified by delly. Could you please confirm if these are the final products of the JSON workflow?

The README further mentions that VCF files need to be annotated with snpEff and then converted using load_annotation_variants.py. I am unsure how to integrate the CSV files I currently have with the subsequent steps in the workflow.

I would greatly appreciate your guidance on how to proceed. Thank you very much for your assistance.

sachalau commented 2 weeks ago

Dear @muguading Sorry again for the delay in answering.

As mentioned in the catalogue methods, we use an intermediary step where we load our variants into a Postgres database after we have genotyped our samples. You can find our database layout in the respective folder of the repository. So we have logic, not currently available in the repository, that loads these CSV into our database. However that shouldn't prevent you from being able to annotate the variants, as you only need to extract the variant coordinates (#CHROM POS REF ALT of the VCF) to annotate with SnpEff.

Once this is done, we only extract new variants and run them through SnpEff. After we have run then through SnpEff, a python scripts corrects some of the systematic errors we have identified with SnpEff and loads the new annotation into our database. You can find the script in the repository as well, and the SnpEff command that we used on the main README.

muguading commented 2 weeks ago

Dear @muguading Sorry again for the delay in answering.

As mentioned in the catalogue methods, we use an intermediary step where we load our variants into a Postgres database after we have genotyped our samples. You can find our database layout in the respective folder of the repository. So we have logic, not currently available in the repository, that loads these CSV into our database. However that shouldn't prevent you from being able to annotate the variants, as you only need to extract the variant coordinates (#CHROM POS REF ALT of the VCF) to annotate with SnpEff.

Once this is done, we only extract new variants and run them through SnpEff. After we have run then through SnpEff, a python scripts corrects some of the systematic errors we have identified with SnpEff and loads the new annotation into our database. You can find the script in the repository as well, and the SnpEff command that we used on the main README.

Hello,

I'm glad to receive your response.

I understand how to use snpEff to annotate VCF files and then use your provided Python script for subsequent analysis. However, I have a few questions regarding the VCF files generated by bcftools, GATK, and freebayes:

  1. How should I merge the VCF files generated by these software?
  2. Or should I just take the union of these three VCF files for subsequent resistance annotation?

Thank you!

sachalau commented 1 week ago

Hi @muguading Although we used three different genotypers, for most purposes, including resistance calling, we used only the outputs of freebayes. However, you need to correctly handled multiple consecutive nucleotide variants that freebayes outputs. If you are not sure that your logic will correctly handle these cases, then you should probably use the outputs of bcftools.