AlexKnyshov / alibaseq

Alignment-Based Sequence extraction
MIT License
13 stars 5 forks source link

How to eliminate intron sequences in the output files? #5

Closed sanvva closed 2 years ago

sanvva commented 2 years ago

`for f in assemblies/*.fasta do makeblastdb -in $f -dbtype nucl -parse_seqids done

ls assemblies/*.fasta | rev | cut -f1 -d/ | rev > list_of_files_to_seach_against.txt

../blast_wrapper.sh ./baits/ ./assemblies/ 1e-10 tblastn 40 n ./list_of_files_to_seach_against.txt

mkdir blast_results

mv *.blast blast_results

python ../alibaseqPy3.py -x a -f M -b ./blast_results/ -t ./assemblies/ -e 1e-10 --is --amalgamate-hits --ac aa-tdna -o alibaseq_oput_x-a -s out --log-header`

In the output folder "alibaseq_oput_x-a", some gene.fasta have intron sequences, resulting problems in sequence alignment. Baits are protein sequences.

image

The real alignment is this:

image

So, how to completely exclude all intron sequences?

AlexKnyshov commented 2 years ago

Hi,

OK. Would it be possible for you to provide one of the log files, maybe for one of the samples that still has the intron in the output?

sanvva commented 2 years ago

Hi,

OK. Would it be possible for you to provide one of the log files, maybe for one of the samples that still has the intron in the output? alibaseq_default.log

This four assemblies all have intron.

AlexKnyshov commented 2 years ago

Thank you, however, this file only contains a generic information. I would need to see a sample log file _.log (so something like ZL144_k41.scafSeq.fasta_default.log in your case) to see the details. Would you happen to know the original intron size? In my evaluations, TBLASTN can occasionally match parts of introns to bait exon AA sequence, by extending HSP into introns, and sometimes this extension can be over 50bp. If this happens, our script program can not really remove the introns completely. So either blast parameters need to be adjusted, or some intron aware aligner such as exonerate needs to be used after the fact to predict and trim out intronic leftovers...

sanvva commented 2 years ago

Thank you, however, this file only contains a generic information. I would need to see a sample log file _.log (so something like ZL144_k41.scafSeq.fasta_default.log in your case) to see the details. Would you happen to know the original intron size? In my evaluations, TBLASTN can occasionally match parts of introns to bait exon AA sequence, by extending HSP into introns, and sometimes this extension can be over 50bp. If this happens, our script program can not really remove the introns completely. So either blast parameters need to be adjusted, or some intron aware aligner such as exonerate needs to be used after the fact to predict and trim out intronic leftovers... ZL147_k41.scafSeq.fasta.blast_default.log.zip

Here is the log file of sample ZL147. My baits files are results of OrthoFinder from 3 plant genome annotation proteins. I also find that, short introns are often kept in the alibaseq output.

AlexKnyshov commented 2 years ago

Sorry for my late response. Could you tell me the ID /name of the particular locus that is on your screenshots? The log file you sent me has the entire set of loci processed. In general, I think, what you observing is a known problem when parts of the bait match beginnings/ ends of introns. Sadly, blast in not accurate enough to precisely delimit introns, and in such cases alibaseq's output would contain some leftover intronic sequences. You can try making blast search more stringent (Based on the picture you sent, it looks like the alignment is very conserved), for example, by setting -gapextend parameter to more than (default) of 1. See section 8.14 here https://books.google.com/books?id=zDqksXUaoBAC&lpg=PT144&ots=L2BBpbynYO&dq=8.14%20Consider%20Using%20Ungapped%20Alignment%20for%20BLASTX%2C%20TBLASTN%2C%20and%20TBLASTX&hl=ru&pg=PT144#v=onepage&q&f=false

AlexKnyshov commented 2 years ago

If you wish to experiment with gapextend and are using the blast_wrapper.sh wrapper, you would have to manually edit lines 64 and 105 to paste in -gapextend 2 at the end of the lines. The default is 1, so 2 or more should hopefully improve the situation

sanvva commented 2 years ago

Sorry for my late response. Could you tell me the ID /name of the particular locus that is on your screenshots? The log file you sent me has the entire set of loci processed.

9094.1473

sanvva commented 2 years ago

If you wish to experiment with gapextend and are using the blast_wrapper.sh wrapper, you would have to manually edit lines 64 and 105 to paste in -gapextend 2 at the end of the lines. The default is 1, so 2 or more should hopefully improve the situation

Thanks, I'll try it.

AlexKnyshov commented 2 years ago

I examined the logs concerning this locus and I believe the reason for extra intron in that locus is TBLASTN extension of matching region into intron as I speculated earlier. The logs actually indicate that this test run of yours was run in tdna-tdna mode instead of aa-tdna so protein coordinates were not parsed appropriately. Nevertheless, the issue still holds - TBLASTN identified 2 matched regions in the 9094.1473 bait to C44233266, one 1-98AA, and the other 66-306AA. There's a ~30AA overlap (~90bp) which prevented alibaseq from using both regions, and it selected higher scoring second region. That second region is likely the intron containing region, with 90bp spuriously aligning to the bait AA sequence upstream. So I believe increasing gapextend parameter of TBLASTN should limit spurious alignment extension, and also help to recover the other exon for that locus in that sample, since the overlap should be much smaller if even present.

sanvva commented 2 years ago

I examined the logs concerning this locus and I believe the reason for extra intron in that locus is TBLASTN extension of matching region into intron as I speculated earlier. The logs actually indicate that this test run of yours was run in tdna-tdna mode instead of aa-tdna so protein coordinates were not parsed appropriately. Nevertheless, the issue still holds - TBLASTN identified 2 matched regions in the 9094.1473 bait to C44233266, one 1-98AA, and the other 66-306AA. There's a ~30AA overlap (~90bp) which prevented alibaseq from using both regions, and it selected higher scoring second region. That second region is likely the intron containing region, with 90bp spuriously aligning to the bait AA sequence upstream. So I believe increasing gapextend parameter of TBLASTN should limit spurious alignment extension, and also help to recover the other exon for that locus in that sample, since the overlap should be much smaller if even present.

Yes, the 98bp intron region extracted of ZL147 is the end of the intron nearby exon 2. I've also run with "--ac aa-tdna", the results were same to "--ac tdna-tdna". I've tried to set "-gapextend 2", the result was same to the default (-gapextend 1). "-gapextend 4" error occurred:

BLAST query/options error: Gap existence and extension values of 11 and 4 not supported for BLOSUM62 supported values are: 32767, 32767 11, 2 10, 2 9, 2 8, 2 7, 2 6, 2 13, 1 12, 1 11, 1 So I set "-gapopen 32767 -gapextend 32767", the gene sequence extracted was 793bp, longer than "-gapextend 1or2"(723bp). I also extracted the C44233266 sequences(11881 bp) from ZL147 assembly. mRNA sequence of the bait genes which was extracted from annotated genome was also extracted. The annotated genome is one of the 3 genomes used to find baits by Orthofinder. After alignment, I found that "-gapextend 1or2" failed to extract exon 1, and "-gapopen 32767 -gapextend 32767" extracted the right exon 1 but wrong exon 2, no intron extracted. I checked the log file (attached below), the exon 2 was extracted from contig C43759570. Can I increase the e-value and chose "-gapopen 32767 -gapextend 32767"?

1653442078884 ZL147_gapopen32767_gapextend32767.log

AlexKnyshov commented 2 years ago

OK, I see. I can try help further, but I think I would need to see a standard blast output when searching that specific protein against that specific genome, could you provide that? Something similar to below, do not add outfmt 6: tblastn -query [file with the sequence 9094.1473] -db [file with the contig C44233266 / with ZL147 genome] -evalue 1e-10

sanvva commented 2 years ago

OK, I see. I can try help further, but I think I would need to see a standard blast output when searching that specific protein against that specific genome, could you provide that? Something similar to below, do not add outfmt 6: tblastn -query [file with the sequence 9094.1473] -db [file with the contig C44233266 / with ZL147 genome] -evalue 1e-10 9094.1473_to_C44233266_C43759570.txt 9094.1473_to_ZL147_genome.txt

sanvva commented 2 years ago

Can I increase the e-value and chose "-gapopen 32767 -gapextend 32767"?

There is no improvement when changing "1e-10" to "1-e15".

../blast_wrapper.sh ./baits/ ./assemblies/ 1e-15 tblastn 40 n ./list_of_files_to_seach_against.txt
python ../alibaseqPy3.py -x a -f M -b ./blast_results/ -t ./assemblies/ -e 1e-15 --is --amalgamate-hits --ac aa-tdna -o alibaseq_oput_x-a -s out --log-header`
AlexKnyshov commented 2 years ago

Thank you for providing that blast alignment, I should have asked for that from the beginning instead of guessing: blast I highlighted the problematic alignment produced by TBLASTN. Since alibaseq only works with final scores and ranges, it does not have a capability to adjust the aligned region, so if the best blast result is spurious, so will be the result of our program. This is an ungapped alignment extension, so changing gap options would not help. The extension is also a part of a larger HSP which itself on average is a very good match, with high e-value, so changing e-value would not help either. The only option I can suggest to try is changing the scoring matrix, set by -matrix flag. This issue of spurious alignments of BLAST has been covered elsewhere (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3834790/ or https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/ ), and in part due to protein blast scoring matrices being designed for divergent sequences. So you could try choosing the least distant matrix like PAM30, or choosing IDENTITY (so adding -matrix PAM30 or -matrix IDENTITY to the blast command instead of gapextend). Identity scoring is raw-distance based so should theoretically be more useful for closely related sequences of yours. I hope this helps, but if not, you might have to consider protein aligner options other than TBLASTN, or perhaps use CDS instead of protein as a bait and run a nucleotide blast search (since your organisms are so closely related, it might work well), or use exonerate to trim out spurious extensions from the results...

sanvva commented 2 years ago

Thank you for providing that blast alignment, I should have asked for that from the beginning instead of guessing: blast I highlighted the problematic alignment produced by TBLASTN. Since alibaseq only works with final scores and ranges, it does not have a capability to adjust the aligned region, so if the best blast result is spurious, so will be the result of our program. This is an ungapped alignment extension, so changing gap options would not help. The extension is also a part of a larger HSP which itself on average is a very good match, with high e-value, so changing e-value would not help either. The only option I can suggest to try is changing the scoring matrix, set by -matrix flag. This issue of spurious alignments of BLAST has been covered elsewhere (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3834790/ or https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/ ), and in part due to protein blast scoring matrices being designed for divergent sequences. So you could try choosing the least distant matrix like PAM30, or choosing IDENTITY (so adding -matrix PAM30 or -matrix IDENTITY to the blast command instead of gapextend). Identity scoring is raw-distance based so should theoretically be more useful for closely related sequences of yours. I hope this helps, but if not, you might have to consider protein aligner options other than TBLASTN, or perhaps use CDS instead of protein as a bait and run a nucleotide blast search (since your organisms are so closely related, it might work well), or use exonerate to trim out spurious extensions from the results...

Thanks you very much! Both matrix PAM30 and IDENTITY worked! IDENTITY got Ns in the exon2. So PAM30 is what I'm looking for. 1653532004011