alekseyzimin / EviAnn_release

This is the standalone version of the EviAnn pipeline
GNU General Public License v3.0
15 stars 1 forks source link

Detection of Pseudogenes failed, error with proteins_all.faa file #8

Open mlammari opened 1 month ago

mlammari commented 1 month ago

Hello,

I want to use EviAnn for some test data I have for Citrus Sinensis and I followed the instructions on the README page but I am encountering an issue I have not been able to resolve. I keep getting this error:

[Mon Jul 22 19:05:39 UTC 2024] Aligning proteins Error with file '.../proteins_all.faa' [Mon Jul 22 19:05:40 UTC 2024] Filtering protein alignment file [Mon Jul 22 19:05:42 UTC 2024] Running exonerate on the filtered sequences [Mon Jul 22 19:05:42 UTC 2024] Detecting and annotating processed pseudogenes [Mon Jul 22 19:05:42 UTC 2024] Detection of pseudogenes failed

I prepared my RNA-seq data and protein homology data as instructed. I ran the sample command. Could this possibly be an issue with exonerate?

alekseyzimin commented 1 month ago

Hello, ".../proteins_all.faa" does not look like a valid path, maybe you meant "../proteins_all.faa" ?

mlammari commented 1 month ago

Hello,

I was actually able to fix this problem. I had another issue I wanted to ask about regarding tblastn and exonerate. When running it as is it takes several hours to run, and it never ran to completion. I tweaked the tblastn command by adding -subject_besthit to it (I added it to the command found in the eviprot.sh script) and it was able to run. However, like tblastn, once the pipeline reached exonerate it also took a considerable amount of time. The genome I'm working with is around 371 M, and the protein file as around 22M. I was wondering if this is a normal amount of time for a genome and protein file of this size and if there are ways to ensure tblastn runs to completion. Thank you!

alekseyzimin commented 1 month ago

eviprot is the longest part of the pipeline, it takes 2-3 days to align about 500Mb of protein sequences to a 2.5Gbp mouse genome on a 24-core Intel Xeon server. -subject_besthit option will reduce sensitivity a lot. Aligning 22Mb of protein sequence to 371Mbp genome should be relatively trivial, 2-3 hours at the most. What computer are you using (cores/RAM)?

mlammari commented 1 month ago

I am running this on a 32 core server with 246 g of RAM total. For more detail, when running tblastn without -subject_besthit, it was able to convert all the .tmp files to .out files except for one batch. This one batch is what kept tblastn running indefinitely.

alekseyzimin commented 1 month ago

Thank you, I will re-check running with this option (-subject_besthit), maybe I am confusing it with something else, because according to description it should not be harmful to the result.

mlammari commented 1 month ago

Thank you for checking in on this issue. I appreciate the help!

On Mon, Jul 29, 2024 at 11:31 AM Aleksey Zimin @.***> wrote:

Thank you, I will re-check running with this option (-subject_besthit), maybe I am confusing it with something else, because according to description it should not be harmful to the result.

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/EviAnn_release/issues/8#issuecomment-2256633218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZZQ3XOMFH7U5K3O7UNVV7TZO2DALAVCNFSM6AAAAABLJADLLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJWGYZTGMRRHA . You are receiving this because you authored the thread.Message ID: @.***>

alekseyzimin commented 1 month ago

I confirm, using option -subject_besthit in tblastn does not affect the results. I will include it into the next release.

alekseyzimin commented 1 month ago

Please check out the new release I just posted