Closed zpf0117b closed 3 years ago
The log of task gtdbtk using the command bash metaGEM.sh -j 4 -m 32 -c 4 -t gtdbtk -l showed:
Unlocking snakemake ... Unlocking working directory.
Dry-running snakemake jobs ... Building DAG of jobs... Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
[Tue Jul 27 02:55:42 2021] Job 0: WARNING: Be very careful when adding/removing any lines above this message. The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly, therefore adding/removing any lines before this message will likely result in parser malfunction.
Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. Do you wish to submit this batch of jobs on your local machine? (y/n)y Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
Select jobs to execute... Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
Select jobs to execute...
[Tue Jul 27 02:55:44 2021] Job 0: WARNING: Be very careful when adding/removing any lines above this message. The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly, therefore adding/removing any lines before this message will likely result in parser malfunction.
[Tue Jul 27 02:55:44 2021] Finished job 0. 1 of 1 steps (100%) done Complete log: /data/main/metaGEM/.snakemake/log/2021-07-27T025544.096465.snakemake.log
Hi @zpf0117b,
Thanks for brining this bug to my attention, indeed line 1 of the script is trying to load the tidyverse
R package which is not actually a dependency in the metagem env. Just FYI, the tidyverse
includes dplyr
and ggplot2
.
I have now fixed this (https://github.com/franciscozorrilla/metaGEM/commit/ee2fdea597742517a03b942dceb08e829fbd9277) so that the ggplot2
+ dplyr
packages are loaded.
In your case you may also want to simply install the tidyverse package to avoid re-cloning or manually modifying scripts:
source activate metagem
conda install -c r r-tidyverse
Regarding the GTDBTk jobs, it seems like they failed for samples 1 and 3.
Could you share the log files for those jobs? You can find them in the logs/
subfolder:
ll logs/|grep -i gtdb
Hi, @franciscozorrilla ,
Sadly there is nothing in the logs/ subfolder.
The full output message of GTDBTk jobs is
Setting current directory to root in config.yaml file ...
Parsing Snakefile to target rule: gtdbtk ...
Do you wish to continue with these parameters? (y/n)y Proceeding with gtdbtk job(s) ...
Please verify parameters set in the config.yaml file:
path: root: /data/main/metaGEM scratch: /tmp folder: data: dataset logs: logs assemblies: assemblies scripts: scripts crossMap: crossMap concoct: concoct maxbin: maxbin metabat: metabat refined: refined_bins reassembled: reassembled_bins classification: GTDBTk abundance: abundance GRiD: GRiD GEMs: GEMs SMETANA: SMETANA memote: memote qfiltered: qfiltered stats: stats proteinBins: protein_bins dnaBins: dna_bins pangenome: pangenome kallisto: kallisto kallistoIndex: kallistoIndex benchmarks: benchmarks scripts: kallisto2concoct: kallisto2concoct.py prepRoary: prepareRoaryInput.R binFilter: binFilter.py qfilterVis: qfilterVis.R assemblyVis: assemblyVis.R binningVis: binningVis.R modelVis: modelVis.R compositionVis: compositionVis.R taxonomyVis: taxonomyVis.R carveme: media_db.tsv toy: download_toydata.txt GTDBtkVis: cores: fastp: 8 megahit: 12 crossMap: 12 concoct: 12 metabat: 12 maxbin: 12 refine: 12 reassemble: 12 classify: 2 gtdbtk: 12 abundance: 12 carveme: 4 smetana: 12 memote: 4 grid: 12 prokka: 2 roary: 12 params: cutfasta: 10000 assemblyPreset: meta-sensitive assemblyMin: 1000 concoct: 800 metabatMin: 50000 seed: 420 minBin: 1500 refineMem: 1600 refineComp: 50 refineCont: 10 reassembleMem: 1600 reassembleComp: 50 reassembleCont: 10 carveMedia: M8 smetanaMedia: M1,M2,M3,M4,M5,M7,M8,M9,M10,M11,M13,M14,M15A,M15B,M16 smetanaSolver: CPLEX roaryI: 90 roaryCD: 90 envs: metagem: metagem metawrap: metawrap prokkaroary: prokkaroary
Please pay close attention to make sure that your paths are properly configured! Do you wish to proceed with this config.yaml file? (y/n)y
Unlocking snakemake ... Unlocking working directory.
Dry-running snakemake jobs ... Building DAG of jobs... Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
[Wed Jul 28 00:27:56 2021] Job 0: WARNING: Be very careful when adding/removing any lines above this message. The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly, therefore adding/removing any lines before this message will likely result in parser malfunction.
Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. Do you wish to submit this batch of jobs on your local machine? (y/n)y Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
Select jobs to execute...
[Wed Jul 28 00:27:58 2021] Job 0: WARNING: Be very careful when adding/removing any lines above this message. The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly, therefore adding/removing any lines before this message will likely result in parser malfunction.
Gathering /data/main/metaGEM/GTDBTk/sample1 /data/main/metaGEM/GTDBTk/sample2 /data/main/metaGEM/GTDBTk/sample3 ... [Wed Jul 28 00:27:58 2021] Finished job 0. 1 of 1 steps (100%) done Complete log: /data/main/metaGEM/.snakemake/log/2021-07-28T002758.193068.snakemake.log
This message shows the log is stored in the /data/main/metaGEM/.snakemake/log/2021-07-28T002758.193068.snakemake.log file, and here is the content of this file:
Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count min threads max threads
all 1 1 1 total 1 1 1
Select jobs to execute...
[Wed Jul 28 00:27:58 2021] Job 0: WARNING: Be very careful when adding/removing any lines above this message. The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly, therefore adding/removing any lines before this message will likely result in parser malfunction.
[Wed Jul 28 00:27:58 2021] Finished job 0. 1 of 1 steps (100%) done Complete log: /data/main/metaGEM/.snakemake/log/2021-07-28T002758.193068.snakemake.log
OK, I think I see what's going on. You are trying to run GTDBTk locally instead of submitting to the cluster scheduler, which is why you don't have any files in the logs/
folder. GTDBTk requres ~204 GB or RAM to run succesfully, so that is likely why your jobs are failing. You could try adding the --scratch_dir <dir>
flag to the GTDBTk call in the Snakefile on this line:
However, I would recommend keeping the Snakefile as is and submitting GTDBTk jobs to the cluster instead of running locally.
Also note that when running locally, only one job will be submitted at a time, and you do not need to specify the number of jobs (-j
), cores (-c
), or RAM (-m
), as the last two parameters are only used for cluster job submissions. To modify the number of cores used by the jobs locally you need to modify the appropriate fields in the config.yaml
file. Local jobs will use all RAM available to them.
Let me know if this helps or if you have additional questions!
Hi, @franciscozorrilla ,
Seems like GTDBTk works smoothly, here is the result in the GTDBTk
> ls /data/main/metaGEM/GTDBTk/sample1 align gtdbtk.ar122.markers_summary.tsv gtdbtk.bac120.filtered.tsv gtdbtk.bac120.msa.fasta gtdbtk.bac120.user_msa.fasta gtdbtk.log gtdbtk.warnings.log classify gtdbtk.bac120.classify.tree gtdbtk.bac120.markers_summary.tsv gtdbtk.bac120.summary.tsv gtdbtk.failed_genomes.tsv gtdbtk.translation_table_summary.tsv identify
> ls /data/main/metaGEM/GTDBTk/sample2 align gtdbtk.ar122.filtered.tsv gtdbtk.ar122.summary.tsv gtdbtk.bac120.filtered.tsv gtdbtk.bac120.summary.tsv gtdbtk.log identify classify gtdbtk.ar122.markers_summary.tsv gtdbtk.ar122.user_msa.fasta gtdbtk.bac120.markers_summary.tsv gtdbtk.bac120.user_msa.fasta gtdbtk.translation_table_summary.tsv gtdbtk.ar122.classify.tree gtdbtk.ar122.msa.fasta gtdbtk.bac120.classify.tree gtdbtk.bac120.msa.fasta gtdbtk.failed_genomes.tsv gtdbtk.warnings.log
> ls /data/main/metaGEM/GTDBTk/sample3 align gtdbtk.ar122.markers_summary.tsv gtdbtk.bac120.filtered.tsv gtdbtk.bac120.msa.fasta gtdbtk.bac120.user_msa.fasta gtdbtk.log gtdbtk.warnings.log classify gtdbtk.bac120.classify.tree gtdbtk.bac120.markers_summary.tsv gtdbtk.bac120.summary.tsv gtdbtk.failed_genomes.tsv gtdbtk.translation_table_summary.tsv identify
> ls /data/main/metaGEM/GTDBTk/sample1/classify/ gtdbtk.bac120.classify.tree gtdbtk.bac120.summary.tsv intermediate_results
> ls /data/main/metaGEM/GTDBTk/sample2/classify/ gtdbtk.ar122.classify.tree gtdbtk.ar122.summary.tsv gtdbtk.bac120.classify.tree gtdbtk.bac120.summary.tsv intermediate_results
> ls /data/main/metaGEM/GTDBTk/sample3/classify/ gtdbtk.bac120.classify.tree gtdbtk.bac120.summary.tsv intermediate_results
However, there came out another error of compositionVis job after I rename the file GTDBTk.stats
to GTDBtk.stats
in order to fix the error:
In file(file, "rt"): cannot open file 'GTDBtk.stats': No such file or directory
and match the input of compositionVis.R
taxonomy=read.delim("GTDBtk.stats",header=TRUE) %>%
The error message shows:
During startup - Warning message: Setting LC_CTYPE failed, using "C" Warning message: package 'ggplot2' was built under R version 4.0.5
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Warning message: package 'dplyr' was built under R version 4.0.5 Error in separate(., classification, into = c("kingdom", "phylum", "class", : could not find function "separate" Calls: %>% Execution halted [Wed Jul 28 16:41:41 2021] Error in rule compositionVis: ...
And here is the first two lines of the file GTDBtk.stats
(originated from GTDBTk.stats
) if it can help:
user_genome classification fastani_reference fastani_reference_radius fastani_taxonomy fastani_ani fastani_af closest_placement_reference closest_placement_radius closest_placement_taxonomy closest_placement_ani closest_placement_af pplacer_taxonomy classification_method note other_related_references(genome_id,species_name,radius,ANI,AF) msa_percent translation_table red_value warnings
bin.1.orig d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Coprococcus;s__Coprococcus eutactus_A GCF_001404675.1 95.0 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Coprococcus;s__Coprococcus eutactus_A 98.92 0.97 GCF_001404675.1 95.0 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Coprococcus;s__Coprococcus eutactus_A 98.92 0.97 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Coprococcus;s__ taxonomic classification defined by topology and ANI topological placement and ANI have congruent species assignments GCA_900767685.1, s__Coprococcus sp900767685, 95.0, 89.43, 0.61; GCF_000154425.1, s__Coprococcus eutactus, 95.0, 89.18, 0.83; GCA_900557435.1, s__Coprococcus sp900557435, 95.0, 88.75, 0.64; GCA_900548215.1, s__Coprococcus sp900548215, 95.0, 88.31, 0.68; GCA_900548315.1, s__Coprococcus sp900548315, 95.0, 88.14, 0.76; GCF_003482105.1, s__Coprococcus sp000433075, 95.0, 82.19, 0.27; GCF_003461625.1, s__Coprococcus sp900066115, 95.0, 80.55, 0.22; GCA_002437435.1, s__Coprococcus sp002437435, 95.0, 80.05, 0.2; GCF_000154245.1, s__Coprococcus sp000154245, 95.0, 80.01, 0.27; GCA_900761435.1, s__Coprococcus sp900761435, 95.0, 77.48, 0.1 47.47 11 N/A N/A
Glad the GTDBTk jobs ran succesfully and the results are now present.
Thank you for highlighting this additional bug, indeed the script should be loading the file GTDBTk.stats
instead of GTDBtk.stats
, this is now fixed in the latest commit (https://github.com/franciscozorrilla/metaGEM/commit/35eff4cb04b309068898e4c9ac91bd5dade54de4).
Regarding your last error, it appears the separate()
function is in the tidyr package, which is part of the tidyverse package. I believe your problems should be solved by installing either:
conda install -c r r-tidyr
or
conda install -c r r-tidyverse
I will update the metaGEM
recipe file to include either the tidyverse or tidyr packages.
Hi, @franciscozorrilla , it appears the file compositionVis.R needs another package tidytext (see the discussion in https://twitter.com/juliasilge/status/1077606510551683072 and the documentation in https://cran.r-project.org/web/packages/tidytext/tidytext.pdf), which enables the function scale_x_reordered()
. We can install this package by installing:
conda install -c r r-tidytext
Here, we finish the compositionVis task successfully.
I have now replaced dplyr and ggplot2 with tidyverse + added tidytext in both the metagem_env.yml
conda recipe file and the compositionVis.R
script.
Thanks a lot for reporting these bugs! If everything is working smoothly now I will close the issue, feel free to reopen if anything else comes up.
Hi Francisco,
I have a question regarding the composition-- In my case, I have >90% of MAGs which are not assigned to the rank of genus/species. Therefore, I am unable to generate the plots using compositionVis.R
When tried to make the classical abundance table by combining abundance.stat and GTDBTk.stat; the GTDBTk.stat is missing the sample id of the MAG (I have attached the text files for your reference). Could you please suggest a fix to generate a taxonomy+abundance table?
Thank you. Kunal GTDBTk_stats.txt abundance_stats.txt
Hi Kunal,
You can simply modify the R script to show e.g. class or genus level taxonomic assignments instead by replacing the species
term in the following command.
You should also remove the filter step at the start, e.g. for class level taxonomic assignments, modify the above to lines as follows:
ggplot(taxab) +
geom_bar(aes(x=reorder_within(class,-rel_ab,sample),y=rel_ab*100),stat="identity") +
The plot currently fails to visualize abundances because the abundance + taxonomy files cannot be merged. For some reason, the two files that you have provided seem to correspond to different sets of MAGs? For example, the abundance table has 17 MAGs, while the taxonomy table has 18 MAGs. One has bin IDs bin.40.orig
, while the other has IDs M.bin.1.o
, not really sure how you managed to do that, maybe the files from different analyeses got mixed up?
To fix this issue, please use the correct abundance and taxonomy files.
Best wishes, Francisco
Hi!
Thank you for your reply. Actually, the files are from the same run (I had provided a few lines from each file as representative). As you mentioned, in one of the files (abundance.stats) the IDs start with the sample name "M", however the corresponding GTDBTk.stats file the bin IDs don't have the sample IDs. Do you think I have messed up somewhere?
Thanks a lot. Kunal
Yes, something has gone wrong. Could you provide more details regarding the steps that you followed?
If you followed the general workflow outlined in the tutorial, the sample IDs should be encoded in the MAG filename. Specifically, in section 5 you can see how to run the extractDnaBins
rule. For exmaple, let's say your sample IDs are ERR260137
, then your bin IDs should automatically be named e.g. ERR260137_bin.1.o.fa
.
Here you can see the underlying code, essentially it just copies and renames the bins from the metawrap bin reassembly output.
Yes, I have followed the steps described in the tutorial with -l, because I am running it on a cloud instance. All my files are perfectly labeled with the sample initials e.g. "M", and also segregated into the respective folders (as could be seen from the screenShot- GEMs and protein bin folder).
I see, those are your sample IDs. Could you also check the contents of your dna_bins
folder?
I would suggest trying to re-generate the abundance.stats
file as well as the GTDBTk.stats
file.
Please run the compositionVis
rule again to re-generate the files, or alternatively run the code to manually re-generate each file:
The MAG IDs should follow the pattern {sampleID}{bin_ID}.fa
Ya, even the IDs of the dna.bins looks fine. Thanks for your inputs, I will rerun the compositionVis.
Thanks a lot. Kunal
Hi Francisco,
I ran the compositionVis task on the toy dataset you provide with the command bash metaGEM.sh -j 8 -m 32 -c 24 -h 20 -t compositionVis -l, but got error message as follows.
It seems this error was caused by 2 reasons.
And the files in the folder GTDBTk/sample2/classify/ was OK. Maybe the task of gtdbtk went something wrong?
Could you tell me how to fix the error?