Trimming of ends affects PIS sequence number in pi_pos_all.fasta

vinitamehlawat commented 2 years ago

Hi @qianjiaqiang

I have small query regarding the the effect of cleaning of data (Trimming of ends) :

So I have aligned my sequences with two different methods and Ran VENAS and also ran on clean sequences, I have clean the ends of my sequences of both methods and I checked the results which were quite different.

Without trimming ends I have less number of PIS sequences in pi_pos_all.fasta BUT after trimming of ends I have counted the number of sequences in pi_pos_all.fasta which I found a quite largfe in number.

Please suggest which Method I should follow for VENAS, the raw aligned data OR aligned and clean ends?

Thank you Vinita

Reilly-cao commented 2 years ago

Hi Vinita @vinitamehlawat, The purpose of sequence alignment is to align all sequences with the same position and length. In this case, the sequence after trim should give less PIS sequences than the sequence without trim. What method did you use for the multiple sequence alignment? Were all sequences aligned to the reference sequence?

vinitamehlawat commented 2 years ago

I have used Nextalign and MAFFT online version

In Nextalign I removed first 50 bp from 5' and ~150 bp from 3' end

In MAFFT I removed first 150 bp from 5' end and ~3kb from 3' end

Reilly-cao commented 2 years ago

We will do a two-step filter when using parsimony-informative.py. The -r parameter ensures that every pis with a frequency greater than the -r value is an A, T, C, G base. If the pis of the sequence at this position is not A, T, C, G, e.g. it is N, then this sequence will be screened out. So when the sequence contains more N, then the more likely it is to be filtered. When the sequence was trimmed, especially the ends, it becomes less likely that the sequence contains N at the PIS position, and the sequence was less likely to be screened out. In addition, the -b parameter is a filter for valid base sites, In the default parameters, the PIS was effective if the number of unambiguous bases (A,T,C,G) ≥80% of the total genome. We recommend to use the raw aligned data. "parsimony-informative.py" will also trim the sequences of multiple sequence alignment. If you want to keep more sequences, you can adjust the parameters. If -r=1 and -b=0 then all sequences will be retained.

vinitamehlawat commented 2 years ago

Thank you @Reilly-cao

This helps a lot!

Vinita

BioMedBigDataCenter / VENAS

Trimming of ends affects PIS sequence number in pi_pos_all.fasta #4