WGLab / NanoCaller

Variant calling tool for long-read sequencing data
MIT License
98 stars 8 forks source link

info required on duplicate reads #11

Closed tahashmi closed 3 years ago

tahashmi commented 3 years ago

Hi Umair,

I hope you are doing well! I am working on cluster scale acceleration of different neural networks based (like DeepVariant) variant calling complete workflows (BWA/minimap2 for alignment, sorting and duplicates removal stages). I am using PySpark to leverage the benefits of Apache Arrow in-memory columnar data format. I want to integrate NanoCaller in my workflow as well. I have couple of questions regarding this.

For PacBio CCS data: 1). You mentioned "PacBio CCS alignment files for HG001 are downloaded from the GIAB database [30, 34]" Do you mean this ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/ dataset? If yes, did you align this dataset yourself or just used BAM files. I want to know if duplicate removal step is essential for this dataset or this is PCR-free.

2). Does PCR-free FASTQ dataset means no need of running duplicates removal application on it (just for my info.)?

3). Does PacBio CCS reads available on precisionFDA challengev2 website needs duplicates removal stage?

Thanks!

umahsn commented 3 years ago

Hi Tanveer,

Thank you for your interest in NanoCaller.

For HG001 (NA12878), we used aligned CCS reads from here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/PacBio_SequelII_CCS_11kb/HG001_GRCh38/) so we didn't do any alignment ourselves. For HG002-HG004, we obtained PacBio CCS FASTQ files from precisionFDA (now available at https://data.nist.gov/pdr/od/id/mds2-2336). We aligned these reads with minimap2 without any duplicate removal. Generally, for long-reads sequencing, you do not need to do any duplicate removal, regardless of PCR-free or not.

You can consult the supplementary file of precisionFDA challenge preprint to see what different teams did for alignment and any other pre-processing steps. Since we did not sequence these genomes, we do not if anything else was done before the FASTQ files were released publicly, so it would be better to consult with the sequencing team about that.

Just FYI, we used the following minimap2 command for CCS reads, with settings used in several aligned files provided by GIAB (example): minimap2 -a -x map-pb -k 19 -O 5,56 -E 4,1 -B 5 -z 400,50 -r 2k --eqx --secondary=no

Let me know if there are any further questions.

tahashmi commented 3 years ago

Thank you so much Umair for your quick and detailed reply. It's clear now.

Would you like to mention a short flow of tools in ensemble mode (like mentioned here) for your precisionFDA challenge submission? I want to know if I can use PySaprk dataframes for SAM data to integrate such ensemble mode variant calling workflow. Thanks.

umahsn commented 3 years ago

For our precisionFDA submission we had a somewhat elementary ensemble. We ran NanoCaller, Clair and Medaka separately on BAM files. After variant calling we took variant quality scores (QUAL field) of predictions made by each of the three variant callers and scaled them so that were all on a uniform scale. The reason for this is most of Clair's quality score were mostly in 500-1000, Medaka's score were mostly in 0-25, and NanoCaller's quality scores were mostly in 30-400 range.

We performed a weighted vote for each variant call, instead of majority voting. More precisely, we looked at each variant call in the union of three tools, issued the same call in final output with the following quality score:

QUAL=(NanoCaller QUAL score for the variant +Clair QUAL score for the variant +Medaka QUAL score for the variant)/3

If a tool did not make the variant call in union set we gave a quality score of 0 for that variant caller. So if there was a variant call made by NanoCaller with QUAL=240, but Medaka and Clair did not make this variant call, then the ensemble will issue the same variant call with QUAL=(214+0+0)/3=80.

So instead of removing calls with low number of votes, we just gave it a lower score. The reason for this is that there are several variant calls that are uniquely called by only one or two methods, so it would not seem prudent to discard such calls. This allows us maximum recall, but with quality score filtering we can make the callset more stringent to remove low vote supporting calls.

tahashmi commented 3 years ago

Thanks Umair for this detailed answer. Is it possible to contact you on your email for a possible collaboration? Thanks.

umahsn commented 3 years ago

Hi Tanveer, yes that would be great. Could you give me your email so I can copy the rest of my team in the email as well?

tahashmi commented 3 years ago

Sure Umair. This is my email id: t.ahmad@tudelft.nl