How to extract all exons of Ortholog Genes (OGs) from chromosome-level assemblies?

sanvva commented 2 years ago

I've got Single_Copy_Orthologue_Sequences from 3 annotated genomes(YYL, LBL, HTL, protein sequences) by using OrthoFinder2. I try to extract OGs from these three and another three genomes by AliBaSeq, but most of the extracted genes are patchy from exon to exon in all 6 genomes. So I chose 1 OG (comprised of 14 exons, 5085aa, 15255nt, attatched below) to modifiy the parameters by got no improvement. I've tried not adding "--is", using "PAM30" matrix, changing "--hit-ovlp 40" or "ctg-ovlp 40".

In the tblastn results, all six genome got hit in all 13 exons excepted the shortest one--exon2 (36aa).

How to get all exons of baits genes from 1 hit contig/scaffold?

`ls assemblies/*.fasta | rev | cut -f1 -d/ | rev > list_of_files_to_seach_against.txt

../blast_wrapper_matrix-PAM30.sh ./bait4039/ ./assemblies/ 1e-10 tblastn 30 n ./list_of_files_to_seach_against.txt

mkdir blast_results

mv *.blast blast_results

python ../alibaseqPy3.py -x a -f M -b ./blast_results/ -t ./assemblies/ -e 1e-10 --amalgamate-hits --ac aa-tdna -o alibaseq_out-no-stitcher --log-header -s no-stitcher-log`

6genomes_extracted.pdf 9094.4039.split.fas.zip blast_results LBL_ref_genome.fasta.blast_default.log YYL_ref_genome.fasta.blast_default.log

ericRLgordon commented 2 years ago

Try upping the evalue for tblastn to 1e-5 to get the short exon. Not sure about the difference in performance for the different genomes. What do you think @AlexKnyshov?

sanvva commented 2 years ago

Try upping the evalue for tblastn to 1e-5 to get the short exon. Not sure about the difference in performance for the different genomes. What do you think @AlexKnyshov?

The tblastn results are the same after increasing evalue from "1e-10" to "1e-5". Oops.

ericRLgordon commented 2 years ago

I see. Perhaps try hmmr? AliBaSeq won't be able to find anything that isn't already found in the input files. You could also try a more closely related query sequence.

Is the patchiness of the results mostly based on these very short exons?

sanvva commented 2 years ago

I see. Perhaps try hmmr? AliBaSeq won't be able to find anything that isn't already found in the input files. You could also try a more closely related query sequence.

Is the patchiness of the results mostly based on these very short exons?

Not always short exons, see 6genomes_extracted.pdf attached above.

The first 3 genomes (YYL, LBL, HTL) are the reference genomes where baits come from.

AlexKnyshov commented 2 years ago

I'm sorry I got back to you so late.

So, as to why in general they are patchy despite blast hits are there - judging by log of LBL, the 11 HSPs found were split into 2 pseudo contigs, instead of being kept as 1, and then only one out of two (the longest) was selected for output. The reason for split is probably the overlap. See rows 3 and 4 of your LBL table, row 3 ends at 2207AA, but row 4 starts at 2179AA, same part of query matched to two different places. Looking at your image, several extracted sequences end or start btw exons 4 and 5, intron there is super short. This consistent spurious match is a problem of TBLASTN. Two ways around this problem - enlarge --hit-ovlp parameter to the amount greater than the overlap (in this case seems like 100 would work - it is expressed as nucleotides so AA x3), but at the expense you will get this extra ~100bp likely intronic or duplicated sequence, and will have to trim it out later, possibly using exonerate, alignment trimmer or other tool. Or otherwise, use a different aligner, perhaps one that has more sophisticated intron models, or in general with less false positive matches.
the other problem (not all exons detected) - if HSP is absent from alibaseq's table, it can't do anything about it. I blasted your LBL seq, here's the pic I bet this softmasking of query that you can see around 60-90AA as lowcase letters is what causing the issue. When continuous proteins are aligned it's no biggie, but when alignment of a given exon is performed and low complexity region occupies half of it, the HSP is probably not reported. You can probably turn it off in the blast parameters, but again at the expense of some potential spurious matches, see here for example https://www.biostars.org/p/210569/
if you have highly contiguous assemblies, all your gene exons should be on one contig, so enabling or disabling --is or tweaking --ctg-ovlp will have no impact. In such cases common suspects for no reporting are indeed low scores or same contig overlaps of different HSPs.

Let me know if you have any other issues or need more feedback

sanvva commented 2 years ago

@AlexKnyshov Thanks for your detailed explanations. I've tried to set "--hit-ovlp 100"， the results were much better except one genome which had a 420aa query-overlap in tblastn result). "--hit-ovlp 10000" gave the same result, with 2 HSPs caused by 420aa overlap.

I also tried to use CDS baits, all exons were extracted!

Lastly, I'm no sure the using of "alignment coordinate" parameter (--ac). If using tblastn, then "--ac aa-tdna"; if using blastn, then "--ac dna-dna"?

AlexKnyshov commented 2 years ago

except one genome which had a 420aa query-overlap in tblastn result). "--hit-ovlp 10000" gave the same result, with 2 HSPs caused by 420aa overlap.

Another thing is when you use translated alignments ("tdna-aa", "tdna-tdna", or "aa-tdna"), alibaseq checks for frameshifts when combining HSP, and doesn't allow combining when there are shifts (by my logic, to be combined, overlapping regions of a given sequence should align to same reference in the same frame, otherwise it is really a different set of codons and a different protein). Because the overlaps in your case are technical in nature and are likely at least partially intronic, it is plausible that they are not divisible by 3. This could be the reason why even large hit-ovlp doesnt solve the issue of splitting for certain genes.

I could perhaps add a flag to skip this filtering, but currently not possible. In my previous tests on other types of data it introduced more bad than good, so was disallowed. You can try hacking this in alibaseq by using --ac dna-dna and see if it helps for that one case you mentioned, but I haven't actually tested such approach, maybe the coordinates and consequently the output gets messed up...

Yes, the --ac values correspond to query-target pair. So if query was protein, and target was DNA, which blast then internally translated, --ac should be ac-tdna. And if blastn, then dna-dna.

AlexKnyshov commented 2 years ago

had a 420aa query-overlap

Actually, this kind of blast overlap seems quite enormous. Perhaps there's some repetitive region in the query for that locus?

sanvva commented 2 years ago

You can try hacking this in alibaseq by using --ac dna-dna and see if it helps for that one case you mentioned

No, the results are the same as "--ac aa-tdna".

Actually, this kind of blast overlap seems quite enormous. Perhaps there's some repetitive region in the query for that locus?

tblastn_4039_GENOME_PRJCA005355

Yes, the annotation of this species (GENOME_PRJCA005355) has only the first 8 exons of all 14 exons. The 420aa overlap has two copies in the exon8. But the alibaseq output the exon9-exon14, missing the annotated exon1 to exon8. Censifolium_2repeat_420aa_overlap

9049 4039_hit-ovlp100000

Using cds baits got the first six exons of GENOME_PRJCA005355（without --is）. blastn_cds_no-is

AlexKnyshov commented 2 years ago

OK, I checked this particular gene by downloading the genome and running alibaseq myself. So one thing I forgot to mention is that when you have such large and well aligned hits, e-value will be very high for all hits. This can throw off the HSP comparison algorithm.

You should try adding -m b-e-i to alibaseq's command and retrying. It then should extract the largest piece (568356 to 604714bp), not the part btw 559493 and 568317. This order forces alibaseq to first use bitscore to score alignments and only if that is tie, try evalue, then identity.

The default comparison metric is e/b-i, when it compares difference in evalue and bit and if they are contradictory, it checks identity. This works well for small baits and hybrid capture style contigs, but not so well on full size proteins and large genomic scaffolds, sometimes smaller regions would have higher identity...

However, this only impacts selection which of the two pieces are getting extracted first, it still doesn't merge them.

I found alibaseq refuses to merge this because bait 3410-3569 aligns to scaffold in region 567847-568317, while larger part of the same bait region (2456-3833) aligns to completely different non-overlapping scaffold region 569011-573252. So essentially as I hypothesized before, there's some "repetitive" region with a bit lower identity. Currently alibaseq is forbidden to merge such regions together, because I was afraid it might not do it accurately, and also for phylogenetics loci are typically desired to be devoid of that kind of sequence...

sanvva commented 2 years ago

Yeah! "-m b-e-i --hit-ovlp 100" seems the best option for protein baits when BLAST to chromosome-level assemblies. Thanks you so much! @AlexKnyshov

AlexKnyshov / alibaseq

How to extract all exons of Ortholog Genes (OGs) from chromosome-level assemblies? #6