Trouble with checkm tree and understanding output

UCSBMicrobiologyCore commented 8 years ago

Hi Dr Parks;

I'm having two separate issues:

-firstly, I did a run of checkm tree as follows: checkm tree --ali --nt -x fasta /home/tarn/archaeabins /home/tarn/archaeatree

with a folder containing a whole bunch of different ncbi genomes and four of my bins which I wanted to determine the taxonomy of. When I did this, the entire thing ran to completion, but when I look at the output, it's clear that it read the first bin and did not complete the rest. That's my first issue.

I'm hoping that was a memory issue (as I type this I'm going to try reduced tree to see if it helps). That said, there was a tree file created listed as concatenated.tre. I assumed that was the output tree but when I opened it in a tree-editing program, it shows a small tree of 5 IMG genomes. I tried running the tree command with a few other subsets of genomes and I got the same tree as a reult, and those also proceeded to completion, i.e.:

Identifying marker genes in 4 bins with 1 threads: Finished processing 4 of 4 (100.00%) bins. Saving HMM info to file.

Calculating genome statistics for 4 bins with 1 threads: Finished processing 4 of 4 (100.00%) bins.

Extracting marker genes to align. Parsing HMM hits to marker genes: Finished parsing hits for 4 of 4 (100.00%) bins. Extracting 43 HMMs with 1 threads: Finished extracting 43 of 43 (100.00%) HMMs. Aligning 43 marker genes with 1 threads: Finished aligning 43 of 43 (100.00%) marker genes.

Reading marker alignment files. Concatenating alignments. Placing 4 bins into the genome tree with pplacer (be patient).

...my second issue is--what is this tree and why was it made with these specific genomes?

thank you

Jon

UCSBMicrobiologyCore commented 8 years ago

I apparently missed some of the output for the first issue. Here's what I got when I reran:

[CheckM - tree] Placing bins in reference genome tree.

Identifying marker genes in 115 bins with 1 threads: [Error] Input file does not exists: /home/tarn/archaeatree/bins/Aciduliprofundum boonei T469/genes.gff.4

Saving HMM info to file.

Calculating genome statistics for 115 bins with 1 threads: Finished processing 115 of 115 (100.00%) bins.

Extracting marker genes to align. Parsing HMM hits to marker genes: Finished parsing hits for 1 of 1 (100.00%) bins. Extracting 43 HMMs with 1 threads: Finished extracting 43 of 43 (100.00%) HMMs. Aligning 43 marker genes with 1 threads: Finished aligning 43 of 43 (100.00%) marker genes.

Reading marker alignment files. Concatenating alignments. Placing 115 bins into the genome tree with pplacer (be patient).

{ Current stage: 0:04:45.100 || Total: 0:04:45.100 }

it is telling me that something does not exist for my file '/home/tarn/archaeatree/bins/Aciduliprofundum boonei.fasta'. And then checkm assumedly just stops trying to look for marker genes in the rest of the bins, which explains the output issues I was having, but not why it does this.

Is there some other information I need that I have not provided? The file format I inputed for this specific bin was simply the entire genome pasted into a fasta file. According to the bin stats tree tsv checkm can read and understand this format, so I'm not sure what the issue might be.

Jon

donovan-h-parks commented 8 years ago

Hello Jon,

The output from CheckM you have sent is a bit confusing. Your first message indicates CheckM is processing 4 bins, while your second message indicates CheckM is processing 115 bins. CheckM never filters out genomes so the number of bins being processed should always be consistent.

CheckM identifies which bins (genomes) to process in a directory based on file extensions. So, if you have a mix of .fna and .fa bins only one set of these will be processed. By default, CheckM looks for *.fna though this can be changed with the -x flag. The rational for this is that people often also have protein files (or other files) in the same directory as their bins so CheckM needs to know what you want to process.

The canonical output of CheckM is a table of completeness and contamination estimates. The tree used by CheckM is only a guide tree for establishing what set of marker genes should be used to evaluate each genome. I do not recommend using CheckM as a tool for inferring trees as this is not the primary goal or intention of the software.

Please let me know if you have any other questions. You can find information about CheckM and the different output tables it can produce on the CheckM wiki: https://github.com/Ecogenomics/CheckM/wiki

Cheers, Donovan

UCSBMicrobiologyCore commented 8 years ago

Hi Dr. Parks, thanks for the quick response-- sorry, I went through the wiki prior to running the command, but my english is not the best, so I may have misconstrued what I was reading. I’ll try my best to explain my train of thought.

when I read the description to the tree command, I read this: “Place bins in the reference genome tree.”, and assumed it meant that my bins would be placed in a tree with other reference genomes from checkM. So I ran it with the -x fasta format.

What I got instead from storage>tree>concatenated2.tre was this (sorry for the imgur..I couldn't figure out a way to post images in this box):

http://imgur.com/07oAko3

--these I guess are IMG genomes and are some of the reference genomes provided by checkm, but none of my bins were there, so I assumed this was the incorrect output, and maybe my tree command screwed up somewhere. This was further corroborated by what I saw when I opened up bin_stats.tree.tsv for that run:

http://imgur.com/LqI66O6

which seems to show that my first bin was processed, but something was wrong with the others, as shown by the fact that there are -1 genes predicted and there is no output in the “bins” file associated with this run.

...so then I ran checkm tree again with a subset of four bins from the above run that had worked in the past. When I opened the tree file, it was the exact same.

I guess my two questions are:

is that the correct tree that is supposed to be derived from the tree checkm command?
what is going wrong with the full 115 bin check tree run?

donovan-h-parks commented 8 years ago

Hello,

I'm not fully following what is happening. All 115 bins should end up in the tree. Can you verify that all your genomes have the same file extension (i.e. *.fna).

Can you try running the "lineage_wf" command? If you can send me the exact command you ran and the full output, I should be better able to determine what is happening.

Cheers, Donovan

UCSBMicrobiologyCore commented 8 years ago

sorry for the wall of text:

[tarn@knot ~]$ checkm lineage_wf -x fasta archaeabins archaealineage

[CheckM - tree] Placing bins in reference genome tree.

Identifying marker genes in 115 bins with 1 threads: [Error] Input file does not exists: archaealineage/bins/Aciduliprofundum boonei T469/genes.gff.4

Saving HMM info to file.

Calculating genome statistics for 115 bins with 1 threads: Finished processing 115 of 115 (100.00%) bins.

Extracting marker genes to align. Parsing HMM hits to marker genes: Finished parsing hits for 1 of 1 (100.00%) bins. Extracting 43 HMMs with 1 threads: Finished extracting 43 of 43 (100.00%) HMMs. Aligning 43 marker genes with 1 threads: Finished aligning 43 of 43 (100.00%) marker genes.

Reading marker alignment files. Concatenating alignments. Placing 115 bins into the genome tree with pplacer (be patient).

{ Current stage: 0:16:09.408 || Total: 0:16:09.408 }

[CheckM - lineage_set] Inferring lineage-specific marker sets.

Reading HMM info from file. Parsing HMM hits to marker genes: Finished parsing hits for 1 of 1 (100.00%) bins.

Determining marker sets for each genome bin. Finished processing 1 of 115 (0.87%) bins (current: Hadesarchaea_archaeon_YN Finished processing 2 of 115 (1.74%) bins (current: Aigarchaeota archaeon SC Finished processing 3 of 115 (2.61%) bins (current: Bathyarchaeota_SMTZ1-55) Finished processing 4 of 115 (3.48%) bins (current: Methanomethylophilus alv Finished processing 5 of 115 (4.35%) bins (current: Micrarchaeota archaeonY Finished processing 6 of 115 (5.22%) bins (current: Methanoperedens nitrored Finished processing 7 of 115 (6.09%) bins (current: Halobacterium salinarum) Finished processing 8 of 115 (6.96%) bins (current: Caldivirga maquilingensi Finished processing 9 of 115 (7.83%) bins (current: Methanococcus maripaludi Finished processing 10 of 115 (8.70%) bins (current: Nanoarchaeum equitans K Finished processing 11 of 115 (9.57%) bins (current: Ignisphaera aggregans D Finished processing 12 of 115 (10.43%) bins (current: Micrarchaeum acidiphil Finished processing 13 of 115 (11.30%) bins (current: Nitrososphaera viennen Finished processing 14 of 115 (12.17%) bins (current: Thaumarchaeota archaeo Finished processing 15 of 115 (13.04%) bins (current: Nanopusillus sp. Nst1 Finished processing 16 of 115 (13.91%) bins (current: Thermoproteus uzoniens Finished processing 17 of 115 (14.78%) bins (current: Candidatus Methanospha Finished processing 18 of 115 (15.65%) bins (current: Lokiarchaeum sp. GC14 Finished processing 19 of 115 (16.52%) bins (current: Vulcanisaeta distribut Finished processing 20 of 115 (17.39%) bins (current: Candidatus Parvarchaeu Finished processing 21 of 115 (18.26%) bins (current: Methanospirillum hunga Finished processing 22 of 115 (19.13%) bins (current: Acidilobus_saccharovor Finished processing 23 of 115 (20.00%) bins (current: Methanoplanus petrolea Finished processing 24 of 115 (20.87%) bins (current: Korarchaeum cryptofilu Finished processing 25 of 115 (21.74%) bins (current: Fervidicoccus fontis K Finished processing 26 of 115 (22.61%) bins (current: Candidatus Bathyarchae Finished processing 28 of 115 (24.35%) bins (current: Haloredivivus sp. G17) Finished processing 29 of 115 (25.22%) bins (current: Thermoproteales_YNP_Si Finished processing 30 of 115 (26.09%) bins (current: Woesearchaeota_AR4_gwa Finished processing 31 of 115 (26.96%) bins (current: Methermicoccus shengli Finished processing 32 of 115 (27.83%) bins (current: Pyrococcus_furiosus_DS Finished processing 33 of 115 (28.70%) bins (current: Marine Group III eurya Finished processing 34 of 115 (29.57%) bins (current: Methanobrevibacter smi Finished processing 35 of 115 (30.43%) bins (current: Thermoplasma acidophil Finished processing 36 of 115 (31.30%) bins (current: Ferroglobus placidus D Finished processing 37 of 115 (32.17%) bins (current: Pacearchaeota_RBG_16_P Finished processing 38 of 115 (33.04%) bins (current: Methanoflorens_stordal Finished processing 39 of 115 (33.91%) bins (current: Sulfolobus solfataricu Finished processing 40 of 115 (34.78%) bins (current: Geoarchaeota archaeon Finished processing 41 of 115 (35.65%) bins (current: Aenigmarchaeum subterr Finished processing 42 of 115 (36.52%) bins (current: Iainarchaeum andersoni Finished processing 43 of 115 (37.39%) bins (current: RBG_13_Euryarchaeota_3 Finished processing 44 of 115 (38.26%) bins (current: Candidatus Nitrosopumi Finished processing 45 of 115 (39.13%) bins (current: Methanomethylovorans h Finished processing 46 of 115 (40.00%) bins (current: YNPFFA_archaeon_SCGC_A Finished processing 47 of 115 (40.87%) bins (current: Aigarchaeota archaeon Finished processing 48 of 115 (41.74%) bins (current: Caldisphaera lagunensi Finished processing 50 of 115 (43.48%) bins (current: Methanocorpusculum lab Finished processing 51 of 115 (44.35%) bins (current: Micrarchaeota archaeon Finished processing 52 of 115 (45.22%) bins (current: Methanosaeta thermophi Finished processing 53 of 115 (46.09%) bins (current: Nanosalinarum sp. J07A Finished processing 54 of 115 (46.96%) bins (current: Aigarchaeota archaeon Finished processing 55 of 115 (47.83%) bins (current: Aigarchaeota archaeon Finished processing 56 of 115 (48.70%) bins (current: Aigarchaeota archaeon Finished processing 57 of 115 (49.57%) bins (current: Sulfolobales archaeon Finished processing 58 of 115 (50.43%) bins (current: Micrarchaeota archaeon Finished processing 59 of 115 (51.30%) bins (current: Woesearchaeota_AR15_gw Finished processing 60 of 115 (52.17%) bins (current: Palaeococcus_pacificus Finished processing 62 of 115 (53.91%) bins (current: Candidatus Parvarchaeu Finished processing 63 of 115 (54.78%) bins (current: Methanothermus fervidu Finished processing 64 of 115 (55.65%) bins (current: Micrarchaeota archaeon Finished processing 65 of 115 (56.52%) bins (current: Parvarchaeota archaeon Finished processing 66 of 115 (57.39%) bins (current: Woesearchaeota_AR11_gw Finished processing 67 of 115 (58.26%) bins (current: Woesearchaeota_AR17gw Finished processing 68 of 115 (59.13%) bins (current: Methanotorris igneus K Finished processing 69 of 115 (60.00%) bins (current: Micrarchaeota archaeon Finished processing 70 of 115 (60.87%) bins (current: Methanosarcina barkeri Finished processing 71 of 115 (61.74%) bins (current: Micrarchaeota archaeon Finished processing 72 of 115 (62.61%) bins (current: Micrarchaeota archaeon Finished processing 74 of 115 (64.35%) bins (current: Halophilic archaeon DL Finished processing 76 of 115 (66.09%) bins (current: Halobiforma nitratired Finished processing 77 of 115 (66.96%) bins (current: Methanothermococcus ok Finished processing 78 of 115 (67.83%) bins (current: Diapherotrites GW2011 Finished processing 79 of 115 (68.70%) bins (current: Candidatus Nanosalina Finished processing 80 of 115 (69.57%) bins (current: Thaumarchaeota archaeo Finished processing 81 of 115 (70.43%) bins (current: Archaea_Hadesarchaea_a Finished processing 82 of 115 (71.30%) bins (current: Micrarchaeota archaeon Finished processing 83 of 115 (72.17%) bins (current: Aigarchaeota archaeon Finished processing 84 of 115 (73.04%) bins (current: Candidatus Methanoregu Finished processing 85 of 115 (73.91%) bins (current: Aigarchaeota archaeon Finished processing 86 of 115 (74.78%) bins (current: Methanococcoides burto Finished processing 87 of 115 (75.65%) bins (current: Methanopyrus kandleri Finished processing 88 of 115 (76.52%) bins (current: Woesearchaeota_RBG13 Finished processing 89 of 115 (77.39%) bins (current: Thorarchaeota archaeon Finished processing 90 of 115 (78.26%) bins (current: Ferroplasma acidarmanu Finished processing 91 of 115 (79.13%) bins (current: Archaeoglobus fulgidus Finished processing 92 of 115 (80.00%) bins (current: Nanoarchaeota archaeon Finished processing 93 of 115 (80.87%) bins (current: Bathyarchaeota_group-6 Finished processing 94 of 115 (81.74%) bins (current: Pacearchaeota_AR13_gwb Finished processing 95 of 115 (82.61%) bins (current: Aenigmarchaeota AR5 gw Finished processing 96 of 115 (83.48%) bins (current: Haladaptatus paucihalo Finished processing 98 of 115 (85.22%) bins (current: Aciduliprofundum boone Finished processing 99 of 115 (86.09%) bins (current: Pacearchaeota_RBG_13_P Finished processing 100 of 115 (86.96%) bins (current: Halorhabdus utahensis Finished processing 101 of 115 (87.83%) bins (current: Pacearchaeota_RBG13 Finished processing 102 of 115 (88.70%) bins (current: Methanomassiliicoccus Finished processing 103 of 115 (89.57%) bins (current: Picrophilus torridus Finished processing 104 of 115 (90.43%) bins (current: YNP_Site_19_Thermopro Finished processing 105 of 115 (91.30%) bins (current: Diapherotrites archae Finished processing 106 of 115 (92.17%) bins (current: EuryarchaeotaMarine Finished processing 107 of 115 (93.04%) bins (current: Micrarchaeota archaeo Finished processing 108 of 115 (93.91%) bins (current: Methanocella paludico Finished processing 109 of 115 (94.78%) bins (current: Cenarchaeum symbiosum Finished processing 110 of 115 (95.65%) bins (current: Aigarchaeota archaeon Finished processing 111 of 115 (96.52%) bins (current: Pacearchaeota_AR1_gwc Finished processing 112 of 115 (97.39%) bins (current: Candidatus Parvarchae Finished processing 113 of 115 (98.26%) bins (current: Nanoarchaeota archaeo Finished processing 114 of 115 (99.13%) bins (current: Methanocaldococcus ja Finished processing 115 of 115 (100.00%) bins (current: Aigarchaeota archaeon SCGC AAA471-I13).

Marker set written to: archaealineage/lineage.ms

{ Current stage: 0:01:16.010 || Total: 0:17:25.418 }

[CheckM - analyze] Identifying marker genes in bins.

Identifying marker genes in 115 bins with 1 threads: [Error] Input file does not exists: archaealineage/bins/Aciduliprofundum boonei T469/genes.gff.4

Saving HMM info to file.

{ Current stage: 0:19:25.826 || Total: 0:36:51.244 }

Parsing HMM hits to marker genes: Finished parsing hits for 1 of 1 (100.00%) bins. Aligning marker genes with multiple hits in a single bin: Finished processing 1 of 1 (100.00%) bins.

{ Current stage: 0:00:10.069 || Total: 0:37:01.313 }

Calculating genome statistics for 115 bins with 1 threads: Finished processing 115 of 115 (100.00%) bins.

{ Current stage: 0:00:26.151 || Total: 0:37:27.465 }

[CheckM - qa] Tabulating genome statistics.

Calculating AAI between multi-copy marker genes.

Reading HMM info from file. Parsing HMM hits to marker genes: Finished parsing hits for 1 of 1 (100.00%) bins.

Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity

Acidilobus_saccharovorans_345_15 c__Thermoprotei (UID148) 41 245 158 1 244 0 0 0 0 99.37 0.00 0.00

donovan-h-parks commented 8 years ago

Hello,

CheckM doesn't seem the like the file "Aciduliprofundum boonei T469". This seems to be causing some steps to be truncated after the first genome.

If you send me the FASTA file for this genome, I can check why it is causing problems.

Cheers, Donovan

UCSBMicrobiologyCore commented 8 years ago

thanks! Is there an address you would prefer?

donovan-h-parks commented 8 years ago

donovan.parks [at] gmail.com

UCSBMicrobiologyCore commented 8 years ago

Hi all;

this is just for posterity--Dr Parks worked with me privately, and it was determined the issue was that I had spaces in my file names. I have replaced them with underscores, and things seem to be running smoothly.

thank you!

Ecogenomics / CheckM

Trouble with checkm tree and understanding output #86