Closed clami66 closed 10 months ago
Also, is there always only one additionalNodeEntries
per sample/taxID? If so, why the wildcard *
in the command above?
@LeandroRitter do you have any comments on this? I'm not sure I know what to expect in terms of output from MaltExtract.
On the step where #56 occurs we want to figure out what reference sequence to use for computing breadth of coverage. What we are doing in this line of code, we extract the sequence ID (not taxID) that has the most reads aligned. This is because each taxID has multiple reference sequences (sequence IDs) matching it in the NCBI annotation. The question is: what sequence to use to for computing breadth of coverage? MaltExtract ranks all the ref sequences by the % of reads aligned to each of them. This can a way to select "the best" reference sequence. Now, I know that this does not always work (not for all taxIDs), although it works for the vast majority of taxIDs. I never had time to dig into this problem but my understanding was that it has to do with inconsistencies in the NCBI annotation used for Malt step and the ncbi.tre and ncbi.map files that MaltExtract uses, i.e. the we download from Megan's web-page. My solution to this problem was to add "|| true" to the authentication step that should suppress the errors what the ncbi.tre and ncbi.map do not agree with the tre- and map-files used for Malt / MaltExtract. I know that this worked for the vast majority of taxIDs.
So I have looked a bit into the specific issue with taxid 374840
. Looking at the krakenuniq.output we have:
0.006321 1149 343 4561 2.59 0.246 374840 species Enterobacteria phage phiX174 sensu lato
0.0007316 133 83 444 3.36 0.3993 10844 no rank Enterobacteria phage S13
0.0001485 27 27 75 2.89 0.3 1004952567 sequence AF274751.1 Bacteriophage S13, complete genome
0.0001265 23 23 76 3.96 0.2969 1004952568 sequence M14428.1 Bacteriophage S13 circular DNA, complete genome
0.0006436 117 0 256 3.64 0.1775 338111 no rank Enterobacteria phage ID1
0.0006436 117 117 256 3.64 0.1775 1006696428 sequence DQ079880.1 Coliphage ID1, complete genome
...
So it looks like, while phiX174's children nodes (no rank
taxes) do have a sequence associated to them, the species-level taxid does not have a sequence/assembly/genome! I don't know how often this happens, but my guess is that @LeandroRitter is right and it's an edge case. I realize this explanation is kinda besides the point, but maybe you were also curious :)
@percyfal I was wondering now that you have solved some other issues if your suggestion from slack:
we should probably check that the taxid is present in malt database before processing it further.
would be enough to avoid this issue?
Thanks for the update @clami66. In terms of not having a reference attached to phiX174, then yes, it's an edge case, but the question is where to handle it. I think eliminating it from the malt database would work, only this must be reported somehow in the final output.
I still don't get why so many other jobs, notably Post_Process and Breadth_Of_Coverage, fail though.
@percyfal are you referring to the test on rackham under /proj/nobackup/metagenomics/perun/Early_urbinasation_test3
?
One potential issue I see is when looking at the breadth of coverage logs:
ls -ltr logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_*
a few of the filenames look odd to me, for example:
-rw-rw----+ 1 perun ps30331 0 Apr 4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1791_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr 4 23:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_217204_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr 7 12:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_40324_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr 8 10:36 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1830_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr 8 10:38 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_37329_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr 8 10:39 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_374840_374840.log
-rw-rw----+ 1 perun ps30331 0 Apr 8 10:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_305_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr 8 10:58 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_53358_160791.log
Where the filename should have format {sample}_{taxid}_{refid}.log
, while these file have {sample}_{taxid1}_{taxid2}.log
format instead. Digging further, I noticed that for taxid=160791
the additional node entries has only one line:
cat results/AUTHENTICATION/stg018-b1e1l1/160791/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt
TargetNode 01 02 03 04 05 06 07 08 09 10
which is an issue of a different type from the one described in the initial post.
Another taxid for which there is no refid is taxid=486698
b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ cat results/AUTHENTICATION/stg018-b1e1l1/486698/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt
TargetNode 01 02 03 04 05 06 07 08 09 10
Mycobacterium_riyadhense NA NA NA NA NA NA NA NA NA NA
but then when looking for the logs associated to 486698
in logs/BREADTH_OF_COVERAGE
we get:
b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ ls -ltr logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698*
-rw-rw----+ 1 perun ps30331 0 Apr 4 21:29 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_CP023641.1.log
-rw-rw----+ 1 perun ps30331 0 Apr 4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AP022564.1.log
-rw-rw----+ 1 perun ps30331 0 Apr 4 23:44 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AC004616.1.log
-rw-rw----+ 1 perun ps30331 0 Apr 8 11:05 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_FJ786256.1.log
Which is strange, becase not only there should be no reference associated to that taxid, but the log files cross the taxid with all other refids found in the sample.
I think this points to taxids and refids being paired in an all-against-all fashion inside the snakemake rule. Is this something you expected from this version of the pipeline?
Also strange that for some taxids, where the MaltExtract outputs are non-standard, these are also "used" as refids somehow (e.g. stg018-b1e1l1_374840_374840.log
, stg018-b1e1l1_40324_160791.log
@clami66 yes I'm referring to those tests. I had previously noted that 160791 doesn't have a ref id but that it seemed to pop up in places it shouldn't. Your explanation makes perfect sense, and thanks a lot for looking into this. snakemake expand uses itertools.product by default, so this should probably be swapped for zip somewhere. So to answer your question, no, there should not be all-vs-all taxid-refid pairings.
Hello !
If I may join this conversation, several viruses seem to be sometimes only classified as no rank for example Human parvovirus B19 and in the above case, but also for Influenza for example. It is problematic because a lot of viruses are then probably ignored by the pipeline. I think the best way around would be to include the "no rank" in the pipeline as well. So we would keep "species" and "no rank" all along. Maybe I should create an issue about that ?
Best, Zoé
Le sam. 9 avr. 2022 à 11:45, Claudio Mirabello @.***> a écrit :
@percyfal https://github.com/percyfal are you referring to the test on rackham under /proj/nobackup/metagenomics/perun/Early_urbinasation_test3?
One potential issue I see is when looking at the breadth of coverage logs:
ls -ltr logs/BREADTH_OFCOVERAGE/stg018-b1e1l1*
a few of the filenames look odd to me, for example:
-rw-rw----+ 1 perun ps30331 0 Apr 4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1791_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 4 23:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_217204_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 7 12:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_40324_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:36 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1830_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:38 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_37329_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:39 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_374840_374840.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_305_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:58 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_53358_160791.log
Where the filename should have format {sample}{taxid}{refid}.log, while these file have {sample}{taxid1}{taxid2}.log format instead. Digging further, I noticed that for taxid=160791 the additional node entries has only one line:
cat results/AUTHENTICATION/stg018-b1e1l1/160791/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt TargetNode 01 02 03 04 05 06 07 08 09 10
which is an issue of a different type from the one described in the initial post.
Another taxid for which there is no refid is taxid=486698
b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ cat results/AUTHENTICATION/stg018-b1e1l1/486698/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt TargetNode 01 02 03 04 05 06 07 08 09 10 Mycobacterium_riyadhense NA NA NA NA NA NA NA NA NA NA
but then when looking for the logs associated to 486698 in logs/BREADTH_OF_COVERAGE we get:
b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ ls -ltr logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698* -rw-rw----+ 1 perun ps30331 0 Apr 4 21:29 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_CP023641.1.log -rw-rw----+ 1 perun ps30331 0 Apr 4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AP022564.1.log -rw-rw----+ 1 perun ps30331 0 Apr 4 23:44 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AC004616.1.log -rw-rw----+ 1 perun ps30331 0 Apr 8 11:05 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_FJ786256.1.log
Which is strange, becase not only there should be no reference associated to that taxid, but the log files cross the taxid with all other refids found in the sample.
I think this points to taxids and refids being paired in an all-against-all fashion inside the snakemake rule. Is this something you expected from this version of the pipeline?
Also strange that for some taxids, where the MaltExtract outputs are non-standard, these are also "used" as refids somehow (e.g. stg018-b1e1l1_374840_374840.log, stg018-b1e1l1_40324_160791.log
— Reply to this email directly, view it on GitHub https://github.com/NBISweden/ancient-microbiome-smk/issues/56#issuecomment-1093839574, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTPL5HKGC7AJJJXD6Y2Z73VEFGTNANCNFSM5O6NTLSQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi Zoe, thanks for pointing this out
I think the best way around would be to include the "no rank" in the pipeline as well. ... Maybe I should create an issue about that ?
I agree that it would be good to open a separate issue, so that we can discuss separately on finding possible better strategies to parse the KrakenUniq outputs.
@clami66 yes I'm referring to those tests. I had previously noted that 160791 doesn't have a ref id but that it seemed to pop up in places it shouldn't. Your explanation makes perfect sense, and thanks a lot for looking into this. snakemake expand uses itertools.product by default, so this should probably be swapped for zip somewhere. So to answer your question, no, there should not be all-vs-all taxid-refid pairings.
Yep, here is the culprit: https://github.com/NBISweden/ancient-microbiome-smk/blob/authentic_snake_amendment/workflow/rules/common.smk#L198
And here is the culprit for those taxids popping up in place of refids, thouht I might have been the one to blame: https://github.com/NBISweden/ancient-microbiome-smk/blob/973566f6b1d253522c45f16d500a05c823310e2b/workflow/rules/common.smk#L219
I think that is a feature though? In cases we couldn't identify the ref_id, return the taxid.
Yes it's done on purpose: https://github.com/NBISweden/ancient-microbiome-smk/blob/973566f6b1d253522c45f16d500a05c823310e2b/workflow/rules/common.smk#L230
I just had forgotten about it when I posted earlier
I close it since the authentication seems to have been working file for a few months
I am looking into the authentication scripts and there is one step I am wondering about:
head -2 $OUT_DIR/${RMA6}_MaltExtract_output/default/readDist/*.rma6_additionalNodeEntries.txt | tail -1 | cut -d ';' -f2 | sed 's/'_'/''/1' > $OUT_DIR/name.list
If I understand correctly this should concatenate all
additionalNodeEntries.txt
files and pick from the last one the name of the reference sequence. This however fails at times, for example on Rackham there is this test:ancient_microbiome_workflow/test_results/results_heavy_Pict_data/AUTHENTICATION/bal003
where for taxID 374840 the
additionalNodeEntries
file looks like this:where the reference ID is unavailable, then the
head -2 ...
operation gives unpredictable results and some authentication outputs are missing.In order to address this issue, I am wondering whether it is really necessary to extract the reference sequence ID in this case? Is the
additionalNodeEntries
file always containing only two lines? In that case it might be enough to use the taxID alone.A follow-up question on this is actually about file naming, which is a bit convoluted I think. For example in the previous case we have a folder called:
AUTHENTICATION/bal003/374840/bal003.trimmed.rma6_MaltExtract_output/default/readDist/bal003.trimmed.rma6_additionalNodeEntries.txt
Where the name of the sample
bal003
is repeated three times in the filepath. Would it not be enough to have the sampleID name only in the first subfolder, like so:AUTHENTICATION/bal003/374840/MaltExtract_output/default/readDist/additionalNodeEntries.txt