NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 15 forks source link

Error in rule Plot_Authentication_Score when using FULL NCBI database #137

Open npsonis opened 9 months ago

npsonis commented 9 months ago

Hi, I got the following error Error in rule Plot_Authentication_Score: jobid: 15 input: results/AUTHENTICATION/.30012ssL12c.collapsed_done output: results/overview_heatmap_scores.pdf log: logs/PLOT_AUTHENTICATION_SCORE/plot_authenticationscore.log (check log file(s) for error details) conda-env: /aMeta/.snakemake/conda/55e1a8bf92c9925034725aa9b149b81b shell: Rscript /aMeta/workflow/scripts/plot_score.R results/AUTHENTICATION $(dirname results/overview_heatmap_scores.pdf) &> logs/PLOT_AUTHENTICATION_SCORE/plot_authentication_score.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

and then KrakenUniq_AbundanceMatrix also failed

Error in rule KrakenUniq_AbundanceMatrix: jobid: 11 input: results/KRAKENUNIQ/30012ssL12c.collapsed/krakenuniq.output.filtered output: results/KRAKENUNIQ_ABUNDANCE_MATRIX, results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt, results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_names_list.txt, results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_abundance_matrix.txt, results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_absolute_abundance_heatmap.pdf log: logs/KRAKENUNIQ_ABUNDANCE_MATRIX/KRAKENUNIQ_ABUNDANCEMATRIX.log (check log file(s) for error details) conda-env: /aMeta/.snakemake/conda/55e1a8bf92c9925034725aa9b149b81b shell: Rscript /aMeta/workflow/scripts/krakenuniq_abundance_matrix.R results/KRAKENUNIQ results/KRAKENUNIQ_ABUNDANCE_MATRIX 250 50 &> logs/KRAKENUNIQ_ABUNDANCE_MATRIX/KRAKENUNIQ_ABUNDANCE_MATRIX.log;Rscript /aMeta/workflow/scripts/plot_krakenuniq_abundance_matrix.R results/KRAKENUNIQ_ABUNDANCE_MATRIX results/KRAKENUNIQ_ABUNDANCE_MATRIX &> logs/KRAKENUNIQ_ABUNDANCE_MATRIX/KRAKENUNIQ_ABUNDANCE_MATRIX.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The log files state:

plot_authentication_score.log Error in names(score_per_sample[[i]]) <- *vtmp* : attempt to set an attribute on NULL Execution halted

KRAKENUNIQ_ABUNDANCE_MATRIX.log [1] X. reads taxReads kmers dup cov taxID rank
[9] taxName SAMPLE

<0 rows> (or 0-length row.names) Error in `[<-`(`*tmp*`, i, j, value = 0) : subscript out of bounds Execution halted Any advise?
NikolayOskolkov commented 8 months ago

Hi @npsonis, thank you for reporting this. The issue seems to be on the KrakenUniq step rather than plotting scores rule. Could you please check that the file results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_abundance_matrix.txt is non-empty, and if not, possibly post first 2-3 lines of the file?

npsonis commented 8 months ago

Hi Nikolay, there is no KRAKENUNIQ_ABUNDANCE_MATRIX directory I am afraid.

ZoePochon commented 8 months ago

I think the problem is due to this back to line in the KrakenUniq output files from npsonis but I don't know why this happens.

On Wed, 18 Oct 2023 at 13:32, npsonis @.***> wrote:

Hi Nikolay, there is no KRAKENUNIQ_ABUNDANCE_MATRIX directory I am afraid.

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/aMeta/issues/137#issuecomment-1768256129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTPL5GIK52ML3PVBTAXTUTX7643JAVCNFSM6AAAAAA5UMOSTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRYGI2TMMJSHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

npsonis commented 8 months ago

In the krakenuniq.output.filtered the taxName field is empty in all species

% reads taxReads kmers dup cov taxID rank taxName 3.563 914033 415742.0 14176.0 1720.0 0.0001958 63459.0 species 0.6799 174442 43682.0 1470.0 1350.0 3.717e-05 3562.0 species 0.1493 38312 37447.0 2302.0 54.2 0.000223 129921.0 species 0.1219 31277 101.0 337.0 707.0 7.048e-05 639313.0 species 0.08766 22491 163.0 2174.0 45.6 0.0005272 1852.0 species 0.06241 16011 402.0 1341.0 36.9 9.829e-05 1888.0 species 0.05532 14194 9634.0 1381.0 98.1 6.453e-05 37326.0 species 0.05165 13250 55.0 409.0 176.0 2.774e-05 106592.0 species 0.04261 10932 638.0 2402.0 142.0 4.302e-05 4513.0 species 0.02191 5621 4456.0 439.0 44.6 5.835e-05 1764.0 species 0.02139 5488 1307.0 11977.0 3.53 5.514e-05 77133.0 species 0.02013 5165 3790.0 8847.0 6.85 0.0005396 1969.0 species 0.01727 4430 2353.0 3048.0 17.4 6.279e-05 54571.0 species 0.01471 3775 320.0 293.0 22.0 6.135e-05 722731.0 species

and in the krakenuniq.output the back to line issue is happening as Zoe writes above

KrakenUniq v1.0.4 DATE:2023-09-29T03:21:02Z DB:/hits/faIn the krakenuniq.output.filtered the taxName field is empty in all species

% reads taxReads kmers dup cov taxID rank taxName 3.563 914033 415742.0 14176.0 1720.0 0.0001958 63459.0 species 0.6799 174442 43682.0 1470.0 1350.0 3.717e-05 3562.0 species 0.1493 38312 37447.0 2302.0 54.2 0.000223 129921.0 species 0.1219 31277 101.0 337.0 707.0 7.048e-05 639313.0 species 0.08766 22491 163.0 2174.0 45.6 0.0005272 1852.0 species 0.06241 16011 402.0 1341.0 36.9 9.829e-05 1888.0 species 0.05532 14194 9634.0 1381.0 98.1 6.453e-05 37326.0 species 0.05165 13250 55.0 409.0 176.0 2.774e-05 106592.0 species 0.04261 10932 638.0 2402.0 142.0 4.302e-05 4513.0 species 0.02191 5621 4456.0 439.0 44.6 5.835e-05 1764.0 species 0.02139 5488 1307.0 11977.0 3.53 5.514e-05 77133.0 species 0.02013 5165 3790.0 8847.0 6.85 0.0005396 1969.0 species 0.01727 4430 2353.0 3048.0 17.4 6.279e-05 54571.0 species 0.01471 3775 320.0 293.0 22.0 6.135e-05 722731.0 species

and in the krakenuniq.output the back to line issue is happening as Zoe writes above

KrakenUniq v1.0.4 DATE:2023-09-29T03:21:02Z DB:/aMeta_resources/DBDIR_KrakenUniq_Full_NT DBSIZE:1309646889448 WD:/aMeta CL:/aMeta/.snakemake/conda/659177f07dba9178b5f5b35a35bff6a6/bin/krakenuniq --preload-size 100GB --db /aMeta_resources/DBDIR_KrakenUniq_Full_NT --fastq-input results/CUTADAPT_ADAPTER_TRIMMING/25434ssL12c.collapsed.trimmed.fastq.gz --threads 40 --output results/KRAKENUNIQ/25434ssL12c.collapsed/sequences.krakenuniq --report-file results/KRAKENUNIQ/25434ssL12c.collapsed/krakenuniq.output --gzip-compressed --only-classified-out % reads taxReads kmers dup cov taxID rank taxName 82.17 21081547 21081547 183323763 7.74 0.4916 0 no rank unclassified 17.83 4574147 6866 2028404 44.1 1.865e-05 1 no rank root 17.79 4565118 51763 1964405 44.8 1.834e-05 131567 no rank cellular organisms 11.54 2959823 78665 1847633 24.1 7.168e-05 2 superkingdom Bacteria 8.378 2149485 7585 1258215 24 0.0001329 1783272 clade Terrabacteria group 8.016 2056534 6915 1217020 24.1 0.0002478 201174 phylum Actinobacteria /aMeta_resources/DBDIR_KrakenUniq_Full_NT DBSIZE:1309646889448 WD:/aMeta CL:/aMeta/.snakemake/conda/659177f07dba9178b5f5b35a35bff6a6/bin/krakenuniq --preload-size 100GB --db /aMeta_resources/DBDIR_KrakenUniq_Full_NT --fastq-input results/CUTADAPT_ADAPTER_TRIMMING/25434ssL12c.collapsed.trimmed.fastq.gz --threads 40 --output results/KRAKENUNIQ/25434ssL12c.collapsed/sequences.krakenuniq --report-file results/KRAKENUNIQ/25434ssL12c.collapsed/krakenuniq.output --gzip-compressed --only-classified-out % reads taxReads kmers dup cov taxID rank taxName 82.17 21081547 21081547 183323763 7.74 0.4916 0 no rank unclassified 17.83 4574147 6866 2028404 44.1 1.865e-05 1 no rank root 17.79 4565118 51763 1964405 44.8 1.834e-05 131567 no rank cellular organisms 11.54 2959823 78665 1847633 24.1 7.168e-05 2 superkingdom Bacteria 8.378 2149485 7585 1258215 24 0.0001329 1783272 clade Terrabacteria group 8.016 2056534 6915 1217020 24.1 0.0002478 201174 phylum Actinobacteria

NikolayOskolkov commented 8 months ago

@npsonis thanks a lot for the info. This CR symbol inserted to KU report has started recently, not clear why so far. The issue does not seem to be with aMeta itself but with some recent changes in the KU report. Nevertheless, we should be able to fix this within aMeta by modifying the filtering scripts. @clami66 this issue seems to be related to the issue #136, would it be possible to account for this hidden CR symbol when filtering the KU report?

@npsonis for now a quick workaround would be (manually) processing all the krakenuniq.output files with

sed ‘s/\r/\t/g’ ku_output_file > new_ku_output_file

and then restarting aMeta. Could you please try this and let us know if this fix the issue? Also, I would recommend deleting the whole results/AUTHENTICATION directory prior to restarting aMeta.

npsonis commented 8 months ago

So I did as you proposed. krakenuniq.output: % reads taxReads kmers dup cov taxID rank taxName 82.17 21081547 21081547 183323763 7.74 0.4916 0 no rank unclassified 17.83 4574147 6866 2028404 44.1 1.865e-05 1 no rank root 17.79 4565118 51763 1964405 44.8 1.834e-05 131567 no rank cellular organisms 11.54 2959823 78665 1847633 24.1 7.168e-05 2 superkingdom Bacteria 8.378 2149485 7585 1258215 24 0.0001329 1783272 clade Terrabacteria group 8.016 2056534 6915 1217020 24.1 0.0002478 201174 phylum Actinobacteria 7.831 2009145 296030 1184048 24.1 0.0002473 1760 class Actinobacteria

It seems that there are multiple spaces/tabs after rank column which affects the filtering I think because of the following output

krakenuniq.output.filtered % reads taxReads kmers dup cov taxID rank taxName

(no entries)

NikolayOskolkov commented 8 months ago

@npsonis thanks a lot for testing. Would you mind sharing one krakenuniq.output (which was empty after filtering) with me at Nikolay.Oskolkov@biol.lu.se? If you would not like to do it, could you then please

grep -w 'species' krakenuniq.output

and check whether you have any row that has a value >200 in the third column (taxReads) and >1000 in the fourth column (kmers)? My assumption is that, after you cleaned your krakenuniq.output with "sed", the filtering ran correctly, i.e. krakenuniq.output.filtered is a correct output, it is just missing any records because of conservative filters we use by default in aMeta (taxReads = 200 and kmers = 1000). If this is the case, you could try tor reduce e.g. taxReads down to 100 in the config-file. I would not encourage you to reduce the number of kmers though because this is a more critical filter.

npsonis commented 8 months ago

I have used less strict filtering (50 and 250), but there are cases that exceed 200 and 1000. The problem is that with your solution there are two tabs between rank and taxName (hence when filtering there is nothing in the latter column). I think the problem originate due to the fact that I am using Mobaxterm. However I managed to find a inderect solution. a) I opened the file with the text editor of Mobaexterm and in there I saw that the file has DOS format instead of UNIX, I changed back to UNIX and then b) I used the following commands (could be an overkill and an easier way exist) cp krakenuniq.output krakenuniq.output_orig grep '^[0-9]' krakenuniq.output_orig > krakenuniq_numbers.output grep -P '^[\t]' krakenuniq.output_orig | tr -d '\t' > krakenuniq_taxanames.output grep '^[# %]' krakenuniq.output_orig > krakenuniq_header.output paste krakenuniq_numbers.output krakenuniq_taxanames.output > krakenuniq_numbers_taxanames.output cat krakenuniq_header.output krakenuniq_numbers_taxanames.output > krakenuniq.output rm krakenuniq_numbers.output krakenuniq_taxanames.output krakenuniq_header.output krakenuniq_numbers_taxanames.output

Now aMeta worked and finished

NikolayOskolkov commented 8 months ago

Hmm, yes, dos2unix is what comes first the head when the CR (or ^M symbol) is reported, but I asked the user in the issue #136 and he assured me that the CR in that case was not because of does2unix issue. I will double check whether he also used Mobaexterm. If so, we need to make a special point at FAQ about this. Thanks a lot @npsonis for clarifying, glad it worked out eventually!

npsonis commented 8 months ago

Well it could be not related to dos2unix as in another server that I am working (again with mobaexterm) I do not have this issue, albeit in that case I am using the microbial database not the full NCBI.

NikolayOskolkov commented 8 months ago

@npsonis could you please let me know how you saw in Mobaexterm that the file had a dos file-format? Would you please check if the krakenuniq.output files generated in another server (when you were using Mobaexterm for the Microbial NT) have unix or dos format? Thanks for all the valuable info, this will allow us to improve aMeta!

npsonis commented 8 months ago

This is a printscreen from the default text editor of mobaxterm

image

This is unix (penguin icon), if dos then the windows icon is selected.

npsonis commented 8 months ago

I forgot to mention that in the other server that I used the microbial database the format of the file is unix.

clami66 commented 8 months ago

Hi @npsonis could you provide some more info about the databases you are using:

Could you paste the first few lines of the output of the following command:

less aMeta_resources/DBDIR_KrakenUniq_Full_NT/taxdb

Or check that the extra CR symbol is present in the file above?

Thanks

npsonis commented 8 months ago

Hi @clami66 I downloaded a year ago and I think I did the same as with the microbial one that I downloaded a couple of months ago using the way that you propose (figshare code and wget).

after less aMeta_resources/DBDIR_KrakenUniq_Full_NT/taxDB (not taxdb) I get

0 0 unclassified no rank 1 1 root no rank 2 131567 Bacteria superkingdom 6 335928 Azorhizobium genus 7 6 Azorhizobium caulinodans species 9 32199 Buchnera aphidicola species 10 1706371 Cellvibrio genus 11 1707 Cellulomonas gilvus species 13 203488 Dictyoglomus genus 14 13 Dictyoglomus thermophilum species 16 32011 Methylophilus genus 17 16 Methylophilus methylotrophus species 18 213421 Pelobacter genus 19 18 Pelobacter carbinolicus species 20 76892 Phenylobacterium genus 21 20 Phenylobacterium immobile species 22 267890 Shewanella genus 23 22 Shewanella colwelliana species 24 22 Shewanella putrefaciens species 25 22 Shewanella hanedai species 27 49928 halophilic eubacterium NRCC 41227 species 28 49928 halophilic eubacterium species 29 28221 Myxococcales order 31 80811 Myxococcaceae family 32 31 Myxococcus genus 33 32 Myxococcus fulvus species 34 32 Myxococcus xanthus species 35 83461 Corallococcus macrosporus species 38 47 Archangium disciforme species 39 80811 Archangiaceae family 40 39 Stigmatella genus 41 40 Stigmatella aurantiaca species 42 39 Cystobacter genus 43 42 Cystobacter fuscus species 44 39 Melittangium genus 45 44 Melittangium lichenicola species 47 39 Archangium genus 48 47 Archangium gephyra species 49 80812 Polyangiaceae family 50 49 Chondromyces genus 51 50 Chondromyces apiculatus species 52 50 Chondromyces crocatus species 53 224463 Nannocystis genus 54 53 Nannocystis exedens species 55 49 Polyangium genus 56 39643 Sorangium cellulosum species 57 2649095 Polyangium sp. species 59 481 Vitreoscilla genus 60 2627922 Vitreoscilla sp. species 61 59 Vitreoscilla stercoraria species 62 59 Vitreoscilla beggiatoides species 63 59 Vitreoscilla filiformis species 64 189773 Herpetosiphon genus 65 64 Herpetosiphon aurantiacus species 68 32033 Lysobacter genus 69 68 Lysobacter enzymogenes species

clami66 commented 8 months ago

@npsonis could you do the following:

Download the taxdb file at this link

Overwrite the current taxDB in the full NT database folder with the newly downloaded file

Re-run the analyses that would crash and see if this fixes the issue. If it's too much work to re-run them in their entirety, you could delete the krakenuniq.output* files and re-run on the same output folder to see if they are recreated correctly

npsonis commented 8 months ago

@clami66 I can do it but I am still waiting to finish the analyses that are currently running. Probably next week I will give it a try and let you know.

npsonis commented 8 months ago

@clami66 I confirm that this fixed the issue. From this: % reads taxReads kmers dup cov taxID rank taxName 95.04 1689066 1689066 42342021 1.09 0.1135 0 no rank unclassified 4.958 88109 773 544370 1.39 5.005e-06 1 no rank root 4.634 82352 347 529266 1.34 4.942e-06 131567 no rank cellular organisms 4.543 80732 2040 511959 1.34 1.986e-05 2 superkingdom Bacteria

Now I get this: % reads taxReads kmers dup cov taxID rank taxName 95.04 1689066 1689066 42342021 1.09 0.1135 0 no rank unclassified 4.958 88109 773 544370 1.39 5.005e-06 1 no rank root 4.634 82352 347 529266 1.34 4.942e-06 131567 no rank cellular organisms 4.543 80732 2040 511959 1.34 1.986e-05 2 superkingdom Bacteria

Also the format is UNIX as intended to be.

clami66 commented 7 months ago

Fixed!

LeandroRitter commented 1 week ago

@clami66 the issue with ^M symbol is back, one user reported it. Did you upload the fixed taxDB to the scilifelab figshare?