NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 14 forks source link

Authentication might fail depending on MaltExtract output #56

Closed clami66 closed 10 months ago

clami66 commented 2 years ago

I am looking into the authentication scripts and there is one step I am wondering about:

head -2 $OUT_DIR/${RMA6}_MaltExtract_output/default/readDist/*.rma6_additionalNodeEntries.txt | tail -1 | cut -d ';' -f2 | sed 's/'_'/''/1' > $OUT_DIR/name.list

If I understand correctly this should concatenate all additionalNodeEntries.txt files and pick from the last one the name of the reference sequence. This however fails at times, for example on Rackham there is this test:

ancient_microbiome_workflow/test_results/results_heavy_Pict_data/AUTHENTICATION/bal003

where for taxID 374840 the additionalNodeEntries file looks like this:

TargetNode  01  02  03  04  05  06  07  08  09  10
Enterobacteria_phage_phiX174_sensu_lato NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

where the reference ID is unavailable, then the head -2 ... operation gives unpredictable results and some authentication outputs are missing.

In order to address this issue, I am wondering whether it is really necessary to extract the reference sequence ID in this case? Is the additionalNodeEntries file always containing only two lines? In that case it might be enough to use the taxID alone.

A follow-up question on this is actually about file naming, which is a bit convoluted I think. For example in the previous case we have a folder called:

AUTHENTICATION/bal003/374840/bal003.trimmed.rma6_MaltExtract_output/default/readDist/bal003.trimmed.rma6_additionalNodeEntries.txt

Where the name of the sample bal003 is repeated three times in the filepath. Would it not be enough to have the sampleID name only in the first subfolder, like so:

AUTHENTICATION/bal003/374840/MaltExtract_output/default/readDist/additionalNodeEntries.txt

clami66 commented 2 years ago

Also, is there always only one additionalNodeEntries per sample/taxID? If so, why the wildcard * in the command above?

percyfal commented 2 years ago

@LeandroRitter do you have any comments on this? I'm not sure I know what to expect in terms of output from MaltExtract.

LeandroRitter commented 2 years ago

On the step where #56 occurs we want to figure out what reference sequence to use for computing breadth of coverage. What we are doing in this line of code, we extract the sequence ID (not taxID) that has the most reads aligned. This is because each taxID has multiple reference sequences (sequence IDs) matching it in the NCBI annotation. The question is: what sequence to use to for computing breadth of coverage? MaltExtract ranks all the ref sequences by the % of reads aligned to each of them. This can a way to select "the best" reference sequence. Now, I know that this does not always work (not for all taxIDs), although it works for the vast majority of taxIDs. I never had time to dig into this problem but my understanding was that it has to do with inconsistencies in the NCBI annotation used for Malt step and the ncbi.tre and ncbi.map files that MaltExtract uses, i.e. the we download from Megan's web-page. My solution to this problem was to add "|| true" to the authentication step that should suppress the errors what the ncbi.tre and ncbi.map do not agree with the tre- and map-files used for Malt / MaltExtract. I know that this worked for the vast majority of taxIDs.

clami66 commented 2 years ago

So I have looked a bit into the specific issue with taxid 374840. Looking at the krakenuniq.output we have:

0.006321        1149    343     4561    2.59    0.246   374840  species                       Enterobacteria phage phiX174 sensu lato
0.0007316       133     83      444     3.36    0.3993  10844   no rank                         Enterobacteria phage S13
0.0001485       27      27      75      2.89    0.3     1004952567      sequence                                  AF274751.1 Bacteriophage S13, complete genome
0.0001265       23      23      76      3.96    0.2969  1004952568      sequence                                  M14428.1 Bacteriophage S13 circular DNA, complete genome
0.0006436       117     0       256     3.64    0.1775  338111  no rank                         Enterobacteria phage ID1
0.0006436       117     117     256     3.64    0.1775  1006696428      sequence                                  DQ079880.1 Coliphage ID1, complete genome
...

So it looks like, while phiX174's children nodes (no rank taxes) do have a sequence associated to them, the species-level taxid does not have a sequence/assembly/genome! I don't know how often this happens, but my guess is that @LeandroRitter is right and it's an edge case. I realize this explanation is kinda besides the point, but maybe you were also curious :)

@percyfal I was wondering now that you have solved some other issues if your suggestion from slack:

we should probably check that the taxid is present in malt database before processing it further.

would be enough to avoid this issue?

percyfal commented 2 years ago

Thanks for the update @clami66. In terms of not having a reference attached to phiX174, then yes, it's an edge case, but the question is where to handle it. I think eliminating it from the malt database would work, only this must be reported somehow in the final output.

I still don't get why so many other jobs, notably Post_Process and Breadth_Of_Coverage, fail though.

clami66 commented 2 years ago

@percyfal are you referring to the test on rackham under /proj/nobackup/metagenomics/perun/Early_urbinasation_test3?

One potential issue I see is when looking at the breadth of coverage logs:

ls -ltr logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_*

a few of the filenames look odd to me, for example:

-rw-rw----+ 1 perun ps30331 0 Apr  4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1791_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr  4 23:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_217204_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr  7 12:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_40324_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr  8 10:36 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1830_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr  8 10:38 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_37329_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr  8 10:39 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_374840_374840.log
-rw-rw----+ 1 perun ps30331 0 Apr  8 10:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_305_160791.log
-rw-rw----+ 1 perun ps30331 0 Apr  8 10:58 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_53358_160791.log

Where the filename should have format {sample}_{taxid}_{refid}.log, while these file have {sample}_{taxid1}_{taxid2}.log format instead. Digging further, I noticed that for taxid=160791 the additional node entries has only one line:

cat  results/AUTHENTICATION/stg018-b1e1l1/160791/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt
TargetNode  01  02  03  04  05  06  07  08  09  10

which is an issue of a different type from the one described in the initial post.

Another taxid for which there is no refid is taxid=486698

b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ cat results/AUTHENTICATION/stg018-b1e1l1/486698/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt
TargetNode  01  02  03  04  05  06  07  08  09  10
Mycobacterium_riyadhense    NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

but then when looking for the logs associated to 486698 in logs/BREADTH_OF_COVERAGE we get:

b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ ls -ltr logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698*
-rw-rw----+ 1 perun ps30331 0 Apr  4 21:29 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_CP023641.1.log
-rw-rw----+ 1 perun ps30331 0 Apr  4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AP022564.1.log
-rw-rw----+ 1 perun ps30331 0 Apr  4 23:44 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AC004616.1.log
-rw-rw----+ 1 perun ps30331 0 Apr  8 11:05 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_FJ786256.1.log

Which is strange, becase not only there should be no reference associated to that taxid, but the log files cross the taxid with all other refids found in the sample.

I think this points to taxids and refids being paired in an all-against-all fashion inside the snakemake rule. Is this something you expected from this version of the pipeline?

Also strange that for some taxids, where the MaltExtract outputs are non-standard, these are also "used" as refids somehow (e.g. stg018-b1e1l1_374840_374840.log, stg018-b1e1l1_40324_160791.log

percyfal commented 2 years ago

@clami66 yes I'm referring to those tests. I had previously noted that 160791 doesn't have a ref id but that it seemed to pop up in places it shouldn't. Your explanation makes perfect sense, and thanks a lot for looking into this. snakemake expand uses itertools.product by default, so this should probably be swapped for zip somewhere. So to answer your question, no, there should not be all-vs-all taxid-refid pairings.

ZoePochon commented 2 years ago

Hello !

If I may join this conversation, several viruses seem to be sometimes only classified as no rank for example Human parvovirus B19 and in the above case, but also for Influenza for example. It is problematic because a lot of viruses are then probably ignored by the pipeline. I think the best way around would be to include the "no rank" in the pipeline as well. So we would keep "species" and "no rank" all along. Maybe I should create an issue about that ?

Best, Zoé

Le sam. 9 avr. 2022 à 11:45, Claudio Mirabello @.***> a écrit :

@percyfal https://github.com/percyfal are you referring to the test on rackham under /proj/nobackup/metagenomics/perun/Early_urbinasation_test3?

One potential issue I see is when looking at the breadth of coverage logs:

ls -ltr logs/BREADTH_OFCOVERAGE/stg018-b1e1l1*

a few of the filenames look odd to me, for example:

-rw-rw----+ 1 perun ps30331 0 Apr 4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1791_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 4 23:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_217204_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 7 12:46 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_40324_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:36 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_1830_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:38 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_37329_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:39 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_374840_374840.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_305_160791.log -rw-rw----+ 1 perun ps30331 0 Apr 8 10:58 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_53358_160791.log

Where the filename should have format {sample}{taxid}{refid}.log, while these file have {sample}{taxid1}{taxid2}.log format instead. Digging further, I noticed that for taxid=160791 the additional node entries has only one line:

cat results/AUTHENTICATION/stg018-b1e1l1/160791/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt TargetNode 01 02 03 04 05 06 07 08 09 10

which is an issue of a different type from the one described in the initial post.

Another taxid for which there is no refid is taxid=486698

b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ cat results/AUTHENTICATION/stg018-b1e1l1/486698/stg018-b1e1l1.trimmed.rma6_MaltExtract_output/ancient/readDist/stg018-b1e1l1.trimmed.rma6_additionalNodeEntries.txt TargetNode 01 02 03 04 05 06 07 08 09 10 Mycobacterium_riyadhense NA NA NA NA NA NA NA NA NA NA

but then when looking for the logs associated to 486698 in logs/BREADTH_OF_COVERAGE we get:

b-an01 [/proj/nobackup/metagenomics/perun/Early_urbinasation_test3]$ ls -ltr logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698* -rw-rw----+ 1 perun ps30331 0 Apr 4 21:29 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_CP023641.1.log -rw-rw----+ 1 perun ps30331 0 Apr 4 23:41 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AP022564.1.log -rw-rw----+ 1 perun ps30331 0 Apr 4 23:44 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_AC004616.1.log -rw-rw----+ 1 perun ps30331 0 Apr 8 11:05 logs/BREADTH_OF_COVERAGE/stg018-b1e1l1_486698_FJ786256.1.log

Which is strange, becase not only there should be no reference associated to that taxid, but the log files cross the taxid with all other refids found in the sample.

I think this points to taxids and refids being paired in an all-against-all fashion inside the snakemake rule. Is this something you expected from this version of the pipeline?

Also strange that for some taxids, where the MaltExtract outputs are non-standard, these are also "used" as refids somehow (e.g. stg018-b1e1l1_374840_374840.log, stg018-b1e1l1_40324_160791.log

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/ancient-microbiome-smk/issues/56#issuecomment-1093839574, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTPL5HKGC7AJJJXD6Y2Z73VEFGTNANCNFSM5O6NTLSQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

clami66 commented 2 years ago

Hi Zoe, thanks for pointing this out

I think the best way around would be to include the "no rank" in the pipeline as well. ... Maybe I should create an issue about that ?

I agree that it would be good to open a separate issue, so that we can discuss separately on finding possible better strategies to parse the KrakenUniq outputs.

percyfal commented 2 years ago

@clami66 yes I'm referring to those tests. I had previously noted that 160791 doesn't have a ref id but that it seemed to pop up in places it shouldn't. Your explanation makes perfect sense, and thanks a lot for looking into this. snakemake expand uses itertools.product by default, so this should probably be swapped for zip somewhere. So to answer your question, no, there should not be all-vs-all taxid-refid pairings.

Yep, here is the culprit: https://github.com/NBISweden/ancient-microbiome-smk/blob/authentic_snake_amendment/workflow/rules/common.smk#L198

clami66 commented 2 years ago

And here is the culprit for those taxids popping up in place of refids, thouht I might have been the one to blame: https://github.com/NBISweden/ancient-microbiome-smk/blob/973566f6b1d253522c45f16d500a05c823310e2b/workflow/rules/common.smk#L219

percyfal commented 2 years ago

I think that is a feature though? In cases we couldn't identify the ref_id, return the taxid.

clami66 commented 2 years ago

Yes it's done on purpose: https://github.com/NBISweden/ancient-microbiome-smk/blob/973566f6b1d253522c45f16d500a05c823310e2b/workflow/rules/common.smk#L230

I just had forgotten about it when I posted earlier

LeandroRitter commented 10 months ago

I close it since the authentication seems to have been working file for a few months