edgardomortiz / Captus

Assembly of Phylogenomic Datasets from High-Throughput Sequencing data
https://edgardomortiz.github.io/captus.docs/
GNU General Public License v3.0
20 stars 5 forks source link

MAFFT adjust direction #5

Open EdBiffin opened 9 months ago

EdBiffin commented 9 months ago

Ive noticed that MAFFT is generating alignments with sequences in both forward and reverse orientation. Is it possible to add the MAFFT --adjustdirection flag to the pipeline?

edgardomortiz commented 9 months ago

Hi @EdBiffin!

I didn't use the --adjustdirection flag because during extraction all the sequences are put in the same direction as the sequence you used as reference. I wonder if you could upload one of those alignments here, I would like to solve the issue (or at least explain it)

Thanks

Edgardo

EdBiffin commented 9 months ago

Hi Edgardo, thanks for your quick response. Ive attached an example alignment. Im using a custom reference file which Ive also attached. Look forward to your response. Ed captus_refs_nu_combined.fasta.txt 6164.fna.txt

edgardomortiz commented 9 months ago

Thank you Ed,

Could you also tell me the actual command you used? or even better upload the extraction .log file, this is very strange, the sequences shouldn't be reversed...

Edgardo

EdBiffin commented 9 months ago

captus-assembly_extract.log Please find attached and please let me know if you need anything else.

edgardomortiz commented 9 months ago

Thanks for the patience!

Would it be possible to upload the assembly.fasta for 376903_Malleostemon_tuberculatus and 376896_Austrobaeckea_verrucosa (if they are too big, maybe other smaller assemblies that produce locus 6164 in opposite directions). Finally, so I can try to replicate the issue, what was the captus align?

EdBiffin commented 9 months ago

376903_Malleostemon_tuberculatus__captus-asm copy.zip 376896_Austrobaeckea_verrucosa__captus-asm copy.zip

edgardomortiz commented 9 months ago

Sorry, the link for Malleostemon got broken... (I got the other two)

EdBiffin commented 9 months ago

376903_Malleostemon_tuberculatus__captus-asm copy.zip

edgardomortiz commented 9 months ago

By the way, while checking the reference I noticed you have several sequences with identical names, Captus will only take one of them because they have to be unique to avoid problems (in the picture the duplicates have a 2 after the name, these are just an example, there are many more) image

edgardomortiz commented 9 months ago

I got it!, when you provide a reference of nuclear proteins in nucleotides (CDS), Captus needs to translate it first (because Scipio performs a translated search on the assemblies).

Because I can't assume all sequences are translatable in Frame 1, Captus tries to guess the reading frame for each sequence, it translates it in the six reading frames and selects the frame that produces the fewest stop codons.

Now, I didn't anticipate that in some references like in your case, a sequence like Syzygium_micranthum-6164 can be perfectly translated in Frame 1 and Reverse Frame 3 (and Captus chose the latter in this case), so I will modify the code to choose a positive reading frame in tied cases like this. So basically, the reversed sequences in the alignment 6164 followed this "reversed" protein from Syzygium_micranthum-6164.

Until I post the updated code, the solution would be that you provide the reference in aminoacids unfortunately (or remove Syzygium_micranthum-6164 and provide it in nucleotides) Have you noticed other cases with reversed sequences?

Edgardo

edgardomortiz commented 9 months ago

Actually, in the same locus eucgr-6164 can also be translated in Reverse Frame 1 without stop codons, but with a final stop codon in Frame 1. I guess I will need to add a rule to not count a stop codon when is at the end too.

edgardomortiz commented 9 months ago

Hi again,

This fix will come with the next release (v1.0.1), for now just decompress this attachment and replace your current bioformats.py (in the captus folder that is inside your Captus installation folder) with this version that improves the reading frame prediction. In my tests locus 6164 is now correctly translated in the reference. bioformats.py.zip

EdBiffin commented 9 months ago

Hi Edgrado, that all makes sense - thanks again for your help and look forward to then next release.

edgardomortiz commented 7 months ago

Dear Ed,

In case you didn't patch the previous version, I made the release on Bioconda incorporating many other changes... Let me know if it v1.0.1 works better in this aspect.

Edgardo

EdBiffin commented 7 months ago

Dear Edgardo, thanks for the heads up. I tried the patch and Ive also run some data through using v1.01. All looks good, but I'll let you know if I find any issues. Many thanks for your help. Ed


From: Edgardo M. Ortiz @.> Sent: Tuesday, 5 March 2024 2:11 AM To: edgardomortiz/Captus @.> Cc: Ed Biffin @.>; State change @.> Subject: Re: [edgardomortiz/Captus] MAFFT adjust direction (Issue #5)

CAUTION: External email. Only click on links or open attachments from trusted senders.


Dear Ed,

In case you didn't patch the previous version, I made the release on Bioconda incorporating many other changes... Let me know if it v1.0.1 works better in this aspect.

Edgardo

— Reply to this email directly, view it on GitHubhttps://github.com/edgardomortiz/Captus/issues/5#issuecomment-1976871971, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHX653CFFP52DMMBHZ6JR7LYWSIZVAVCNFSM6AAAAABCOS5XGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZWHA3TCOJXGE. You are receiving this because you modified the open/close state.Message ID: @.***>