bihealth / snappy-pipeline

SNAPPY Nucleic Acid Processing in Python
MIT License
8 stars 4 forks source link

Complete basic use case end-to-end test #10

Open holtgrewe opened 3 years ago

holtgrewe commented 3 years ago

Is your feature request related to a problem? Please describe. We need to create a minimal germline variant calling pipeline use case setup here

Describe the solution you'd like We should have the following steps

for two or three public germline exome data sets, e.g., from IGSR/Thousand Genomes. We should include famouse NA12878, and otherwise just use whatever is there. We need to get this done end-to-end and then can refine (refinement out of scope). Data and static data should be downloaded either directly from IGSR or can be deposited on our public file servers.

Describe alternatives you've considered N/A

Additional context N/A

eudesbarbosa commented 3 years ago

The first working draft is now available in https://github.com/bihealth/snappy-use-case-germline

Analysis dir tree:

.
├── ngs_mapping
│   ├── config.yaml
│   ├── pipeline_job.sh
│   └── slurm_log
├── raw_data
│   └── NA12878
│       ├── NIST7035_TAAGGCGA_L002_R1_001_trimmed.fastq.gz
│       └── NIST7035_TAAGGCGA_L002_R2_001_trimmed.fastq.gz
├── static_data
│   ├── nexterarapidcapture_expandedexome_targetedregions.bed
│   └── nexterarapidcapture_expandedexome_targetedregions.bed.gz
├── variant_calling
│   ├── config.yaml
│   ├── pipeline_job.sh
│   └── slurm_log
└── variant_export
    ├── config.yaml
    ├── pipeline_job.sh
    └── slurm_log

Note: raw_data and static_data are downloaded on the fly.

I still need to update the README, but it follows the same logic as any other project. It needs the miniconda3 directory accessible in one or two levels above and once the files have been download are there you can:

cd GRCh37
cubi-tk snappy kickoff

Issues

  1. The submodule might not be necessary. I can move the relevant parts to the config, same as for the bed file.
  2. I still haven't included any test or expected results. My initial idea was to compare with the results available in same source(*), but they GATK version and parameters are completely different.
holtgrewe commented 3 years ago

@eudesbarbosa nice work 👍

Paths on NIH servers like to change and I would like to see an option/commented out lines of a boiled-down dataset generated by cutting out the genes TTN, BRAF, KRAS, OMA1, and TGDS from the aligned BAM file and then converting them back into FASTQ using bedtools bam2fastq while discarding single-end reads. These should be added to our repository with Git LFS. I'd also like us to add the BED files limited to the regions files mentioned above added to the repository. Ideally, the process of cutting is documented with Bash snippets in a README file.

eudesbarbosa commented 3 years ago

Updated the repo following your suggestions. It seems to work by I will have to choose another example, the one there was randomly selected and it has no variants for the genes you suggested. The best I got were two variants that look like artefacts.

New structure:

.
├── ngs_mapping
│   ├── config.yaml
│   ├── pipeline_job.sh
│   └── slurm_log
├── raw_data
│   └── NA12878
│       ├── giab_gene_panel_R1.fastq.gz
│       └── giab_gene_panel_R2.fastq.gz
├── static_data
│   └── gene_panel_exomes.bed
├── variant_calling
│   ├── config.yaml
│   ├── pipeline_job.sh
│   └── slurm_log
└── variant_export
    ├── config.yaml
    ├── pipeline_job.sh
    └── slurm_log