bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
395 stars 190 forks source link

something went wrong with making summary abundance table. Exiting... #184

Open itiago opened 5 years ago

itiago commented 5 years ago

Hi, I had this error over and over again. Thank you for any help. Best

(metawrap-env) deepbio@deepbio-virtual-machine:~/Documents/AC3_Paper$ metawrap quant_bins -b Coverage_EnergeticaPathwaysGenes/ -o epathCoverageResults/CoverageMetaWrap_output_unbinned -a epathCoverageResults/AC3_bulkcontigs_renamed.fas BamFiles/AC3_2MetaGenMerged_R1_val_1.fastq BamFiles/AC3_2MetaGenMerged_R1_val_2.fastq
metawrap quant_bins -b Coverage_EnergeticaPathwaysGenes/ -o epathCoverageResults/CoverageMetaWrap_output_unbinned -a epathCoverageResults/AC3_bulkcontigs_renamed.fas BamFiles/AC3_2MetaGenMerged_R1_val_1.fastq BamFiles/AC3_2MetaGenMerged_R1_val_2.fastq

------------------------------------------------------------------------------------------------------------------------
-----                                  1 forward and 1 reverse read files detected                                 -----
------------------------------------------------------------------------------------------------------------------------

########################################################################################################################
#####                                    SETTING UP OUTPUT AND INDEXING ASSEMBLY                                   #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                            Indexing assembly file with salmon. Ignore any warnings                           -----
------------------------------------------------------------------------------------------------------------------------

Version Info: This is the most recent version of salmon.
index ["epathCoverageResults/CoverageMetaWrap_output_unbinned/assembly_index"] did not previously exist  . . . creating it
[2019-05-24 21:48:09.284] [jLog] [info] building index
[2019-05-24 21:48:09.285] [jointLog] [info] [Step 1 of 4] : counting k-mers
counted k-mers for 70,000 transcriptsElapsed time: 4.59077s

[2019-05-24 21:48:13.876] [jointLog] [info] Replaced 0 non-ATCG nucleotides
[2019-05-24 21:48:13.876] [jointLog] [info] Clipped poly-A tails from 1 transcripts
[2019-05-24 21:48:13.883] [jointLog] [info] Building rank-select dictionary and saving to disk
[2019-05-24 21:48:13.900] [jointLog] [info] done
Elapsed time: 0.0161781s
[2019-05-24 21:48:13.900] [jointLog] [info] Writing sequence data to file . . .
[2019-05-24 21:48:14.061] [jointLog] [info] done
Elapsed time: 0.161739s
[2019-05-24 21:48:14.068] [jointLog] [info] Building 32-bit suffix array (length of generalized text is 116,432,793)
[2019-05-24 21:48:14.470] [jointLog] [info] Building suffix array . . .
success
saving to disk . . . done
Elapsed time: 0.546009s
done
Elapsed time: 29.6191s
processed 116,000,000 positions[2019-05-24 21:50:42.555] [jointLog] [info] khash had 112,019,499 keys
[2019-05-24 21:50:42.555] [jointLog] [info] saving hash to disk . . .
[2019-05-24 21:50:56.416] [jointLog] [info] done
Elapsed time: 13.8615s
[2019-05-24 21:50:58.655] [jLog] [info] done building index

########################################################################################################################
#####                           ALIGNING READS FROM ALL SAMPLES BACK TO BINS WITH SALMON                           #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                            processing sample AC3_2MetaGenMerged_R1_val with reads                            -----
-----                                BamFiles/AC3_2MetaGenMerged_R1_val_1.fastq and                                -----
-----                                BamFiles/AC3_2MetaGenMerged_R1_val_2.fastq...                                 -----
------------------------------------------------------------------------------------------------------------------------

Version Info: This is the most recent version of salmon.
### salmon (mapping-based) v0.13.1
### [ program ] => salmon
### [ command ] => quant
### [ index ] => { epathCoverageResults/CoverageMetaWrap_output_unbinned/assembly_index }
### [ libType ] => { IU }
### [ mates1 ] => { BamFiles/AC3_2MetaGenMerged_R1_val_1.fastq }
### [ mates2 ] => { BamFiles/AC3_2MetaGenMerged_R1_val_2.fastq }
### [ output ] => { epathCoverageResults/CoverageMetaWrap_output_unbinned/alignment_files/AC3_2MetaGenMerged_R1_val.quant }
### [ meta ] => { }
### [ threads ] => { 1 }
Logs will be written to epathCoverageResults/CoverageMetaWrap_output_unbinned/alignment_files/AC3_2MetaGenMerged_R1_val.quant/logs
[2019-05-24 21:50:59.151] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2019-05-24 21:50:59.151] [jointLog] [warning]

NOTE: It appears you are running salmon without the `--validateMappings` option.
Mapping validation can generally improve both the sensitivity and specificity of mapping,
with only a moderate increase in use of computational resources.
Mapping validation is planned to become a default option (i.e. turned on by default) in
the next release of salmon.
Unless there is a specific reason to do this (e.g. testing on clean simulated data),
`--validateMappings` is generally recommended.

[2019-05-24 21:50:59.151] [jointLog] [info] parsing read library format
[2019-05-24 21:50:59.151] [jointLog] [info] There is 1 library.
[2019-05-24 21:50:59.289] [stderrLog] [info] Loading Suffix Array
[2019-05-24 21:50:59.288] [jointLog] [info] Loading Quasi index
[2019-05-24 21:50:59.288] [jointLog] [info] Loading 32-bit quasi index
[2019-05-24 21:50:59.773] [stderrLog] [info] Loading Transcript Info
[2019-05-24 21:50:59.913] [stderrLog] [info] Loading Rank-Select Bit Array
[2019-05-24 21:50:59.924] [stderrLog] [info] There were 78,734 set bits in the bit array
[2019-05-24 21:50:59.934] [stderrLog] [info] Computing transcript lengths
[2019-05-24 21:50:59.934] [stderrLog] [info] Waiting to finish loading hash

[2019-05-24 21:51:08.267] [jointLog] [info] done
[2019-05-24 21:51:08.267] [jointLog] [info] Index contained 78,734 targets
processed 23,500,000 fragmentserrLog] [info] Done loading index
hits: 8,553,118, hits per frag:  0.364048

[2019-05-24 22:21:12.927] [jointLog] [info] Computed 105,456 rich equivalence classes for further processing
[2019-05-24 22:21:12.927] [jointLog] [info] Counted 8,425,791 total reads in the equivalence classes
[2019-05-24 22:21:12.930] [jointLog] [warning] 0.000125587% of fragments were shorter than the k used to build the index (31).
If this fraction is too large, consider re-building the index with a smaller k.
The minimum read size found was 22.

[2019-05-24 22:21:12.930] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings : 497
[2019-05-24 22:21:12.930] [jointLog] [info] Mapping rate = 35.2722%

[2019-05-24 22:21:12.930] [jointLog] [info] finished quantifyLibrary()
[2019-05-24 22:21:12.930] [jointLog] [info] Starting optimizer
[2019-05-24 22:21:12.992] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2019-05-24 22:21:12.996] [jointLog] [info] iteration = 0 | max rel diff. = 159.415
[2019-05-24 22:21:13.348] [jointLog] [info] iteration = 100 | max rel diff. = 3.01882e-15
[2019-05-24 22:21:13.350] [jointLog] [info] Finished optimizer
[2019-05-24 22:21:13.350] [jointLog] [info] writing output

[2019-05-24 22:21:13.524] [jointLog] [warning] NOTE: Read Lib [[ BamFiles/AC3_2MetaGenMerged_R1_val_1.fastq, BamFiles/AC3_2MetaGenMerged_R1_val_2.fastq]] :

Detected a *potential* strand bias > 1% in an unstranded protocol check the file: epathCoverageResults/CoverageMetaWrap_output_unbinned/alignment_files/AC3_2MetaGenMerged_R1_val.quant/lib_format_counts.json for details

------------------------------------------------------------------------------------------------------------------------
-----                                           summarize salmon files...                                          -----
------------------------------------------------------------------------------------------------------------------------

Starting in: /home/deepbio/Documents/AC3_Paper/epathCoverageResults/CoverageMetaWrap_output_unbinned/alignment_files
Loading counts from: ./AC3_2MetaGenMerged_R1_val.quant quant.sf
"AC3_2MetaGenMerged_R1_val.quant.counts"

########################################################################################################################
#####                                   EXTRACTING AVERAGE ABUNDANCE OF EACH BIN                                   #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                            There were 1 samples detected. Making abundance table!                            -----
------------------------------------------------------------------------------------------------------------------------

************************************************************************************************************************
*****                     something went wrong with making summary abundance table. Exiting...                     *****
************************************************************************************************************************

real    33m8,304s
user    34m43,593s
sys     0m58,643s
(metawrap-env) deepbio@deepbio-virtual-machine:~/Documents/AC3_Paper$
itiago commented 5 years ago

Hi Gherman, did you had the time to look at this error, Am I doing something wrong? Thank you for any input. best

ursky commented 5 years ago

Are you sure that the contigs in the Coverage_EnergeticaPathwaysGenes/ bins have the exact same names are those in epathCoverageResults/AC3_bulkcontigs_renamed.fas? Try running without providing the -a option.

itiago commented 5 years ago

Yes I am sure, they came from the same file. I've tried that, not giving the -a option the error is the same

A terça, 28/05/2019, 18:01, Gherman V. Uritskiy notifications@github.com escreveu:

Are you sure that the contigs in the Coverage_EnergeticaPathwaysGenes/ bins have the exact same names are those in epathCoverageResults/AC3_bulkcontigs_renamed.fas? Try running without providing the -a option.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/184?email_source=notifications&email_token=AEAA5GPOEP6CT6MYVZBIVLLPXVQPBA5CNFSM4HPS33GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMY4DQ#issuecomment-496602638, or mute the thread https://github.com/notifications/unsubscribe-auth/AEAA5GNNFU6N7INF4DBC2ODPXVQPBANCNFSM4HPS33GA .

ursky commented 5 years ago

Can you provide the AC3_2MetaGenMerged_R1_val.quant.counts file?

itiago commented 5 years ago

I dint knew which was which, so I zipped the folder.

On Tue, May 28, 2019 at 6:08 PM Gherman V. Uritskiy < notifications@github.com> wrote:

Can you provide the AC3_2MetaGenMerged_R1_val.quant.counts file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/184?email_source=notifications&email_token=AEAA5GOQZXZYZAKM2TMKTZDPXVRHNA5CNFSM4HPS33GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMZOGA#issuecomment-496604952, or mute the thread https://github.com/notifications/unsubscribe-auth/AEAA5GKDLIV6TUCQDAJVVEDPXVRHNANCNFSM4HPS33GA .

-- Igor Tiago Researcher Universidade de Coimbra

Laboratório Microbiologia Edificio Patronato Rua da Matemática Nº 49 3000-276 Coimbra, Portugal

ursky commented 5 years ago

I don't see an attachment...

itiago commented 5 years ago

I've sent by email, maybe it is way it didn't go

AC3_2MetaGenMerged_R1_val.quant.zip

ursky commented 5 years ago

Your output should have a quant_files folder. Do you have this? I will also need the bins folder.

itiago commented 5 years ago

complete output. thank you for the help! CoverageMetaWrap_output_unbinned.zip

ursky commented 5 years ago

If you look at the output file, it reads:

None of the contigs/scaffolds in the -a metagenomic assembly file were present in the bin files. Please make sure that the bins and total assembly have the exact same bins. One cause for this could be that you reassembled the bins, disrupring the contig naming. If you do not have the original total metagenomic assembly file, then you could not provide the -a option at all (but this is not ideal for abundance estimation).

This means 2 things:

  1. My program is printing this warning to stdout instead of stderr - i just fixed this for future users
  2. Your bins like have a different contig naming convention. Can you post your bin folder?
itiago commented 5 years ago

But why does it happens when I try to do that without the -a assemble.fas? Can I send them as private?

ursky commented 5 years ago

That is a good question, but I do not know until I see them. And sure, feel free to email me. Maybe just a few problematic ones so I can see whats wrong.

ursky commented 5 years ago

The bin (protein) files you gave me have a different naming convention than the contig names in the metawrap output you gave me. For example, k121_28102_ vs k121_4_flag_1_multi_5.0000_len_622. The fact that the metawrap output has the full contig name means that you probably gave it the -a option. Its possible that re-running the module on the same output did not overwrite your earlier attempts. I dont think there is an error here, you just need to re-run the program. Make sure you delete the old metawrap output, and do not provide the -a option.

Two more things. First, be careful when working with gene names in fasta format to make sure every identifier is unique. I noticed you simple truncated the contig name, but what happens when there are more than one gene of interest on one contig? I use names like k121_4_flag_1_multi_5.0000_len_622-1 and k121_4_flag_1_multi_5.0000_len_622-2 - that way i have the gene ID and full contig name. The second is that if you are estimating the gene abundance in DNA data, you will actually get more robust estimates by using the coverage of the whole contig to approximate the gene coverage. This way things like GC biases and random read drop-in and drop-out wont affect it. The core assumption here is that the CPM (counts per million) abundance of a gene is identical to that of the contig that carries it, which it true for metagenomic (DNA) data.

itiago commented 5 years ago

Gherman sorry for not having a description of the files, the files that I sent already had all that in consideration: I renamed all contigs to k121xxxx I used the references of the genes to get the contigs from where they belonged, so each file from genes are in fact the contigs from where the genes come from, and those contigs are named as the contigs k121xxx so even if there are two genes in a contigs (say rubisco small and big subunit) the file will only have one contig. My question is if there is a problem when the same contig is in different files, I consider to be negative since there is a way to know if there is contigs that are shared among files (bins). Because of all this it is way I don't understand why this is not working...

On Wed, May 29, 2019 at 8:28 PM Gherman V. Uritskiy < notifications@github.com> wrote:

The bin (protein) files you gave me have a different naming convention than the contig names in the metawrap output you gave me. For example, k12128102 vs k121_4_flag_1_multi_5.0000_len_622. The fact that the metawrap output has the full contig name means that you probably gave it the -a option. Its possible that re-running the module on the same output did not overwrite your earlier attempts. I dont think there is an error here, you just need to re-run the program. Make sure you delete the old metawrap output, and do not provide the -a option.

Two more things. First, be careful when working with gene names in fasta format to make sure every identifier is unique. I noticed you simple truncated the contig name, but what happens when there are more than one gene of interest on one contig? I use names like k121_4_flag_1_multi_5.0000_len_622-1 and k121_4_flag_1_multi_5.0000_len_622-2 - that way i have the gene ID and full contig name. The second is that if you are estimating the gene abundance in DNA data, you will actually get more robust estimates by using the coverage of the whole contig to approximate the gene coverage. This way things like GC biases and random read drop-in and drop-out wont affect it. The core assumption here is that the CPM (counts per million) abundance of a gene is identical to that of the contig that carries it, which it true for metagenomic (DNA) data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/184?email_source=notifications&email_token=AEAA5GPXBQQSFM72SWLRP7TPX3KMPA5CNFSM4HPS33GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQMWXY#issuecomment-497077087, or mute the thread https://github.com/notifications/unsubscribe-auth/AEAA5GOQGQRWIBN5TIL3273PX3KMPANCNFSM4HPS33GA .

-- Igor Tiago Researcher Universidade de Coimbra

Laboratório Microbiologia Edificio Patronato Rua da Matemática Nº 49 3000-276 Coimbra, Portugal