bcgsc / ntEdit

✏️ Genome assembly polishing & SNV detection
GNU General Public License v3.0
64 stars 9 forks source link

No IUPAC corrections after -s 1 mode #18

Closed sihellem closed 3 years ago

sihellem commented 3 years ago

Hi,

I have been trying ntEdit to polish a draft genome produced by SPAdes and including IUPAC to account for heterozygosity (I need this information in order to filter true generational SNPs later on) as I saw this feature was recently added to ntEdit.

However, none of the bases were corrected to IUPAC encodings.

Here are the commands and versions I used (reads are Illumina PE 250, coverage above 30X):

ntHits/2.7
ntEdit/1.3.1
nthits -b 36 --kmer=40 --threads=40 --outbloom --solid @reads.in
ntedit -s 1 -t 40 -f $ASSEMBLY -r solids_k40.bf

Any idea why that is?

Best, Simon

warrenlr commented 3 years ago

Hi Simon,

Thank you for your message and interest in ntedit.

I see that you are using an older (v1.3.1) version of ntedit. Please note that IUPAC support was introduced in ntEdit v1.3.3. But I recommend you clone the latest (v1.3.4) version as there are bug fixes to the vcf output that are not yet released.

Note: In -s 1 mode (SNV mode), each base is assessed for alternate, including IUPAC characters. The main difference between the polishing and -s 1 mode when it comes to IUPAC handling is that the former will look for possible ALTERNATE bases to IUPAC, where in -s 1 mode, all bases will be assessed.

I hope this helps & thank you for using ntedit!

Happy holidays! Rene

sihellem commented 3 years ago

Hi René,

Thank you for your prompt answer.

Sorry, it seems I made a typo, I am actually using ntEdit/1.3.4.

If I understand correctly your note, all individual alternate for the same position should be found in the .vcf file, right? While in the .fa file, I would find IUPAC encoding if at least two characters were found at the same position. Therefore, any idea why I don't?

Happy holidays to you too, Simon

warrenlr commented 3 years ago
all individual alternate for the same position should be found in the .vcf file, right? 

Only alternate bases that pass the -y (or -Y) filter will be recorded in the vcf file.

While in the .fa file, I would find IUPAC encoding if at least two characters were found at the same position

unfortunately no, this support has not been implemented. The most supported base will be chosen and if two or more have equal support at a given position, the first base reported is chosen.

warrenlr commented 3 years ago

The IUPAC support is limited at this point. Before v1.3.4, characters other than ACGT were ignored. Now these characters are considered for polishing/SNV detection.

We implemented this feature to enable users to quickly evaluate SNPs at IUPAC positions using a wide range or sequencing datasets (with the vcf being the most useful output for this assessment)

sihellem commented 3 years ago

Hi René,

Thanks for these clarifications.

I figured out that I could use the .vcf to include IUPAC coding into the draft assembly in this fashion using bcftools/1.9:

bgzip -c $VCF > $VCF_GZ
tabix -p vcf $VCF_GZ
bcftools consensus --iupac-codes --fasta-ref $ASSEMBLY $VCF_GZ > out.fa

This works only using the initial assembly, not the ntedit-polished version.

I guess I could use a first round of nthit-ntedit on the draft assembly, pass the pipeline a second time on this ntedit-polished version, and then correct it to IUPAC encodings with the above code; thereby still leveraging the rapidity of ntedit to polish.

Would you see any problem with this approach?

warrenlr commented 3 years ago

Hi Simon,

Yes, I think it makes perfect sense.

Additional notes for your consideration: ntedit will only preserve IUPAC (and Ns, since they are not assessed) only if no suitable base replacements are found, which in -s 1 mode I doubt would be the case very often (since all possible bases considered). Also, the -s 1 mode is fairly experimental, and indels aren't considered (so the ntedit output FASTA sequence will be the exact same sequence as that of the input draft and thus the same base coordinates as the input draft -- an important consideration for your IUPAC assignment I think).

Cheers! René

warrenlr commented 3 years ago

I believe it is now safe to close this ticket. Please comment here for related questions, and/or open a new issue as needed. Don't forget to "star" us if you like ntedit and find it useful in your research ;).

Best, René