edgardomortiz / Captus

Assembly of Phylogenomic Datasets from High-Throughput Sequencing data
https://edgardomortiz.github.io/captus.docs/
GNU General Public License v3.0
18 stars 5 forks source link

Captus Extract: missing extractions for some loci in some samples #1

Open LPDagallier opened 1 year ago

LPDagallier commented 1 year ago

Hi,

Thanks a lot for this great tool ! It is very efficient and easy to use, and the documentation and outputs are super clear !

I faced an issue while running Captus extract: the extraction had not been carried out for some loci in some samples (see screenshot of .html report below). I looked at the Scipio .log for one of this sample and it returns a warning: "Warning: query length mismatch. This will produce unpredictable results!" (see full warning at the end of this post and NUC_scipio_initial_run1.log attached).

Interestingly, I re-run the same Captus extract on the same dataset with the exact same parameters (but on a different node of the cluster I'm working on -the node is assigned automatically by the cluster manager), but the problem didn't came out and it seems to have worked perfectly the second time. The Captus extract .log are the same between the 2 runs, but the Scipio initial.log for the aforementioned sample are different: the one from run1 has the warning message (see Scipio logs attached).

This issue seems to be a Scipio issue more than a Captus issue, but do you have any clue of what's going on there? I also wonder if there is a way to report the warnings and/or errors from the sub-progams (like Scipio) to the Captus .log itself. Here, I notice there was a problem from the .html report because it was easy to notice, but a more subtle sub-program issue would easily be missed if not reported directly in the Captus .log. Moreover, when analyses are run on hundreds of samples, it can be cumbersome to go and check the sub-program logs of each sample.

Thanks a lot again for this wonderful tool !

Full warning error: " Processing BLAT hits: ...Warning: query length mismatch. This will produce unpredictable results! substr outside of string at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 999, line 566. Use of uninitialized value $aa in string eq at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 1001, line 566. substr outside of string at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 1052, line 566. Use of uninitialized value $aa in string eq at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 1054, line 566. diff_str returns undef: input is s1: EQQQGGAADEAEPFMGSGRF s2: PRIIDTGFFSKIPPELYHHILKFLS count: 20 from1 :0 from2: 211 query:HLJG-5123 NODE_1521_length_466_cov_5.0000_k_175_flag_1:gaacagcaacaaggcggtgcagcggatgaggccgaaccgttcatgggatccggtcgattt diff_str returned undef! Use of uninitialized value $diff_str in concatenation (.) or string at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 2039, line 566. query:HLJG-5123 target:NODE_1521_length_466_cov_5.0000_k_175_flag_1 Incorrect calculation of unmatched aa's in line 566!

No query sequence 'HLJG-5123_[253]' of length 253 found. Skipped. No query sequence 'HWUP-5123' of length 256 found. [...] "

NUC_scipio_initial_run1.log NUC_scipio_initial_run2.log

Captus extract report for run1: (note the missing extraction for at least 7 samples) captus-assembly_extract report_run1

edgardomortiz commented 1 year ago

Hi Leo-Paul,

I am very happy that you find Captus useful, my guess is that you are running into some RAM limitations (the clue here is that the first loci are extracted but it fails at some point, really weird pattern in your heatmap, interesting), could you update your Captus to the 0.9.91? Recently I parallelized Scipio making it not only faster but also uses less RAM. What Captus version are you using?

If for some reason you can’t update Captus reduce the concurrency of the extract step (to assign more RAM to each extraction)

Scipio is in general very sensitive but some times can be finicky and has weird error messages, so I opted for just reporting fail/pass. However, singe v0.9.91 of Captus Scipio has been behaving nicely.

I was suspecting some sort of error in your reference, but that can’t be or you wouldn't get the locus for any sample

Let me know if this works please, and thanks again for give Captus a chance!

Edgardo

On 22. Feb 2023, at 20:36, Léo-Paul Dagallier @.***> wrote:

Hi,

Thanks a lot for this great tool ! It is very efficient and easy to use, and the documentation and outputs are super clear !

I faced an issue while running Captus extract: the extraction had not been carried out for some loci in some samples (see screenshot of .html report below). I looked at the Scipio .log for one of this sample and it returns a warning: "Warning: query length mismatch. This will produce unpredictable results!" (see full warning at the end of this post and NUC_scipio_initial_run1.log attached).

Interestingly, I re-run the same Captus extract on the same dataset with the exact same parameters (but on a different node of the cluster I'm working on -the node is assigned automatically by the cluster manager), but the problem didn't came out and it seems to have worked perfectly the second time. The Captus extract .log are the same between the 2 runs, but the Scipio initial.log for the aforementioned sample are different: the one from run1 has the warning message (see Scipio logs attached).

This issue seems to be a Scipio issue more than a Captus issue, but do you have any clue of what's going on there? I also wonder if there is a way to report the warnings and/or errors from the sub-progams (like Scipio) to the Captus .log itself. Here, I notice there was a problem from the .html report because it was easy to notice, but a more subtle sub-program issue would easily be missed if not reported directly in the Captus .log. Moreover, when analyses are run on hundreds of samples, it can be cumbersome to go and check the sub-program logs of each sample.

Thanks a lot again for this wonderful tool !

Full warning error: " Processing BLAT hits: ...Warning: query length mismatch. This will produce unpredictable results! substr outside of string at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 999, line 566. Use of uninitialized value $aa in string eq at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 1001, line 566. substr outside of string at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 1052, line 566. Use of uninitialized value $aa in string eq at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 1054, line 566. diff_str returns undef: input is s1: EQQQGGAADEAEPFMGSGRF s2: PRIIDTGFFSKIPPELYHHILKFLS count: 20 from1 :0 from2: 211 query:HLJG-5123 NODE_1521_length_466_cov_5.0000_k_175_flag_1:gaacagcaacaaggcggtgcagcggatgaggccgaaccgttcatgggatccggtcgattt diff_str returned undef! Use of uninitialized value $diff_str in concatenation (.) or string at /apps/captus/0.9.88/lib/python3.11/site-packages/dependencies/scipio-1.4/scipio.1.4.1.pl line 2039, line 566. query:HLJG-5123 target:NODE_1521_length_466_cov_5.0000_k_175_flag_1 Incorrect calculation of unmatched aa's in line 566!

No query sequence 'HLJG-5123_[253]' of length 253 found. Skipped. No query sequence 'HWUP-5123' of length 256 found. [...] "

NUC_scipio_initial_run1.log https://github.com/edgardomortiz/Captus/files/10807626/NUC_scipio_initial_run1.log NUC_scipio_initial_run2.log https://github.com/edgardomortiz/Captus/files/10807627/NUC_scipio_initial_run2.log Captus extract report for run1: (note the missing extraction for at least 7 samples) https://user-images.githubusercontent.com/38457679/220738489-9a3fd9f8-5aed-4173-a37a-3f3f1ce83d5b.svg — Reply to this email directly, view it on GitHub https://github.com/edgardomortiz/Captus/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMADPHMFP5VBNBXSLXU7HTWYZTCPANCNFSM6AAAAAAVEXY4P4. You are receiving this because you are subscribed to this thread.

LPDagallier commented 1 year ago

Hi Edgardo,

Thanks for your answer. I'm using v0.9.88, I will ask for an update to the latest on the shared cluster and see how it goes. The thing is that it is hard to reproduce this issue as it worked perfectly after a second run.

The RAM limitation could be the explanation, it was running with 32 Gb on 8 threads (--threads 8 --ram 32, but no specification via --concurrent).

I will keep you updated, Léo-Paul

edgardomortiz commented 1 year ago

My guess is that is hard to reproduce because when it failed it was running other samples that were sucking up RAM too. That is why I changed the behavior of Scipio to be parallel and lightweight.

Good luck!

Edgardo