AlexKnyshov / alibaseq

Alignment-Based Sequence extraction
MIT License
14 stars 5 forks source link

How to extract all exons of Ortholog Genes (OGs) from chromosome-level assemblies? #6

Closed sanvva closed 2 years ago

sanvva commented 2 years ago

I've got Single_Copy_Orthologue_Sequences from 3 annotated genomes(YYL, LBL, HTL, protein sequences) by using OrthoFinder2. I try to extract OGs from these three and another three genomes by AliBaSeq, but most of the extracted genes are patchy from exon to exon in all 6 genomes. So I chose 1 OG (comprised of 14 exons, 5085aa, 15255nt, attatched below) to modifiy the parameters by got no improvement. I've tried not adding "--is", using "PAM30" matrix, changing "--hit-ovlp 40" or "ctg-ovlp 40".

In the tblastn results, all six genome got hit in all 13 exons excepted the shortest one--exon2 (36aa).

How to get all exons of baits genes from 1 hit contig/scaffold?

`ls assemblies/*.fasta | rev | cut -f1 -d/ | rev > list_of_files_to_seach_against.txt

../blast_wrapper_matrix-PAM30.sh ./bait4039/ ./assemblies/ 1e-10 tblastn 30 n ./list_of_files_to_seach_against.txt

mkdir blast_results

mv *.blast blast_results

python ../alibaseqPy3.py -x a -f M -b ./blast_results/ -t ./assemblies/ -e 1e-10 --amalgamate-hits --ac aa-tdna -o alibaseq_out-no-stitcher --log-header -s no-stitcher-log`

6genomes_extracted.pdf 9094.4039.split.fas.zip blast_results LBL_ref_genome.fasta.blast_default.log YYL_ref_genome.fasta.blast_default.log

ericRLgordon commented 2 years ago

Try upping the evalue for tblastn to 1e-5 to get the short exon. Not sure about the difference in performance for the different genomes. What do you think @AlexKnyshov?

sanvva commented 2 years ago

Try upping the evalue for tblastn to 1e-5 to get the short exon. Not sure about the difference in performance for the different genomes. What do you think @AlexKnyshov?

The tblastn results are the same after increasing evalue from "1e-10" to "1e-5". Oops.

ericRLgordon commented 2 years ago

I see. Perhaps try hmmr? AliBaSeq won't be able to find anything that isn't already found in the input files. You could also try a more closely related query sequence.

Is the patchiness of the results mostly based on these very short exons?

sanvva commented 2 years ago

I see. Perhaps try hmmr? AliBaSeq won't be able to find anything that isn't already found in the input files. You could also try a more closely related query sequence.

Is the patchiness of the results mostly based on these very short exons?

Not always short exons, see 6genomes_extracted.pdf attached above.

The first 3 genomes (YYL, LBL, HTL) are the reference genomes where baits come from.

AlexKnyshov commented 2 years ago

I'm sorry I got back to you so late.

Let me know if you have any other issues or need more feedback

sanvva commented 2 years ago

@AlexKnyshov Thanks for your detailed explanations. I've tried to set "--hit-ovlp 100", the results were much better except one genome which had a 420aa query-overlap in tblastn result). "--hit-ovlp 10000" gave the same result, with 2 HSPs caused by 420aa overlap.

I also tried to use CDS baits, all exons were extracted!

Lastly, I'm no sure the using of "alignment coordinate" parameter (--ac). If using tblastn, then "--ac aa-tdna"; if using blastn, then "--ac dna-dna"?

AlexKnyshov commented 2 years ago

except one genome which had a 420aa query-overlap in tblastn result). "--hit-ovlp 10000" gave the same result, with 2 HSPs caused by 420aa overlap.

Another thing is when you use translated alignments ("tdna-aa", "tdna-tdna", or "aa-tdna"), alibaseq checks for frameshifts when combining HSP, and doesn't allow combining when there are shifts (by my logic, to be combined, overlapping regions of a given sequence should align to same reference in the same frame, otherwise it is really a different set of codons and a different protein). Because the overlaps in your case are technical in nature and are likely at least partially intronic, it is plausible that they are not divisible by 3. This could be the reason why even large hit-ovlp doesnt solve the issue of splitting for certain genes.

I could perhaps add a flag to skip this filtering, but currently not possible. In my previous tests on other types of data it introduced more bad than good, so was disallowed. You can try hacking this in alibaseq by using --ac dna-dna and see if it helps for that one case you mentioned, but I haven't actually tested such approach, maybe the coordinates and consequently the output gets messed up...

Yes, the --ac values correspond to query-target pair. So if query was protein, and target was DNA, which blast then internally translated, --ac should be ac-tdna. And if blastn, then dna-dna.

AlexKnyshov commented 2 years ago

had a 420aa query-overlap

Actually, this kind of blast overlap seems quite enormous. Perhaps there's some repetitive region in the query for that locus?

sanvva commented 2 years ago

You can try hacking this in alibaseq by using --ac dna-dna and see if it helps for that one case you mentioned

No, the results are the same as "--ac aa-tdna".

Actually, this kind of blast overlap seems quite enormous. Perhaps there's some repetitive region in the query for that locus?

tblastn_4039_GENOME_PRJCA005355

Yes, the annotation of this species (GENOME_PRJCA005355) has only the first 8 exons of all 14 exons. The 420aa overlap has two copies in the exon8. But the alibaseq output the exon9-exon14, missing the annotated exon1 to exon8. Censifolium_2repeat_420aa_overlap

9049 4039_hit-ovlp100000

Using cds baits got the first six exons of GENOME_PRJCA005355(without --is). blastn_cds_no-is

AlexKnyshov commented 2 years ago

OK, I checked this particular gene by downloading the genome and running alibaseq myself. So one thing I forgot to mention is that when you have such large and well aligned hits, e-value will be very high for all hits. This can throw off the HSP comparison algorithm.

You should try adding -m b-e-i to alibaseq's command and retrying. It then should extract the largest piece (568356 to 604714bp), not the part btw 559493 and 568317. This order forces alibaseq to first use bitscore to score alignments and only if that is tie, try evalue, then identity.

The default comparison metric is e/b-i, when it compares difference in evalue and bit and if they are contradictory, it checks identity. This works well for small baits and hybrid capture style contigs, but not so well on full size proteins and large genomic scaffolds, sometimes smaller regions would have higher identity...

However, this only impacts selection which of the two pieces are getting extracted first, it still doesn't merge them.

I found alibaseq refuses to merge this because bait 3410-3569 aligns to scaffold in region 567847-568317, while larger part of the same bait region (2456-3833) aligns to completely different non-overlapping scaffold region 569011-573252. So essentially as I hypothesized before, there's some "repetitive" region with a bit lower identity. Currently alibaseq is forbidden to merge such regions together, because I was afraid it might not do it accurately, and also for phylogenetics loci are typically desired to be devoid of that kind of sequence...

sanvva commented 2 years ago

Yeah! "-m b-e-i --hit-ovlp 100" seems the best option for protein baits when BLAST to chromosome-level assemblies. Thanks you so much! @AlexKnyshov