MultiQC / MultiQC

Aggregate results from bioinformatics analyses across many samples into a single report.
http://multiqc.info
GNU General Public License v3.0
1.2k stars 597 forks source link

Exporting general stats table - missingness #1522

Open xavierrocarada opened 3 years ago

xavierrocarada commented 3 years ago

Description of bug

general_stats_table.csv PT_aDNA_1_multiqc_report.html.zip I have attached the html report and the exported general stats table in a comma-separated format. All the samples are in the csv file, but some of them do not have any value in some columns and I know that these values have been calculated because I have seen them in the beeswarm plot. For instance, sample 22219 does not have any information in the Endogenous DNA (%) column and the value is plotted in the beeswarm. Why do data miss in the exported file?

File that triggers the error

No response

MultiQC Error log

No response

jfy133 commented 2 years ago

@xavierrocarada can you send the entire MultiQC results folder? And also if you still have it the .nextflow.log of that run?

xavierrocarada commented 2 years ago

multiqc.zip @jfy133 Thanks for having a look at it! :) I have attached the multiqc results folder, but I am sorry to say that I do not have the .nextflow.log of that run... :(

jfy133 commented 2 years ago

Ok, so the information IS in multiqc_data.json

            "22219": {
                "endogenous_dna": 7.429926,
                "endogenous_dna_post": 5.54621
            },

However, when I search for those two values in the exported CSV/TSV files, those values are associated with: 22204_S57_L001_R2_001.

Sample % Duplicate Reads Average % GC Content Average Sequence Length (bp) Percentage of modules failed in FastQC report (includes those not plotted here) Total Sequences () Duplication rate before filtering Percentage of reads > Q30 after filtering Bases > Q30 after filtering (millions) GC content after filtering Percent reads passing filter % trimmed reads Total trimmed reads () % Duplicate Reads Average % GC Content Average Sequence Length (bp) Percentage of modules failed in FastQC report (includes those not plotted here) Total Sequences () Total reads in the bam file () Reads Mapped in the bam file () Total reads in the bam file () Reads Mapped in the bam file () Percentage of reads categorised as a technical duplicate CF~1 means high library complexity. Large CF means not worth sequencing deeper. Non-unique reads removed after deduplication () Unique mapping reads after deduplication () 3 Prime 1st base substitution frequency for G>A 3 Prime 2nd base substitution frequency for G>A 5 Prime 1st base substitution frequency for C>T 5 Prime 2nd base substitution frequency for C>T Read length std. dev. Median read length Mean read length Average coverage (X) on mitochondrial genome. Average coverage (X) on nuclear genome. Mitochondrial to nuclear reads ratio (MTNUC) Reads on the nuclear genome () Reads on the mitochondrial genome () Mean GC content Fraction of genome with at least 1X coverage Fraction of genome with at least 2X coverage Fraction of genome with at least 3X coverage Fraction of genome with at least 4X coverage Fraction of genome with at least 5X coverage Median coverage Mean coverage % mapped reads Number of mapped reads () Number of reads () Alignment error rate. Total edit distance (SAM NM field) over the number of mapped bases Rate of Error for Chr X Rate of Error for Chr Y Number of positions on Chromosome X vs Autosomal positions. Number of positions on Chromosome Y vs Autosomal positions. #SNPs Covered #SNPs Total Endogenous DNA (%) Endogenous DNA Post (%) Number of SNPs Contamination Estimate (Method1_MOM) Estimate Error (Method1_MOM) Contamination Estimate (Method1_ML) Estimate Error (Method1_ML) Contamination Estimate (Method2_MOM) Estimate Error (Method2_MOM) Contamination Estimate (Method2_ML) Estimate Error (Method2_ML)
22204_S57_L001_R2_001 6.93469429562835 49 101 18.1818181818182 8428752 7.31127 92.6158 1395.931856 55.7652 99.885818341311 98.3596809666944 18904052 1.52336861404443 57 46.8097996027694 9.09090909090909 1229009 13249591 8502941 24595 24595 9 1.09 152467 1497370 12.4848190429925 2.03599881971083 11.587147030185 1.98637911464245 18.4223847868222 43 47.9921496473478 0.028788701792504 0.000235751625664 122.11 15783 9 46.6830699774266 0.221776984778939 0.000756186263807 0.000108272486329 5.0869263521002E-05 2.97056100260488E-05 0 0.0022 100 140126 140126 0.45 0.034336204644391 0.057566950886416 0.682406704462112 0.181985732122619 88628 53227092 7.429926 5.54621 1 0 N/A 0 0 0 N/A 0 0

BUT, When I look in the JSON that you can export, however, this looks to be correct:

> library(jsonlite)
> res <- read_json("~/Downloads/general_stats_table.json")
> what$categories ## here I scrolled to find the endorspy pre/post columns, which is under element '56'
> what$samples[[56]][[51]] ## find the sample names in column 56, I identify 22219 in 51
[1] "22219"
> what$datasets[[56]][[51]]
[1] 7.429926
> what$datasets[[57]][[51]]
[1] 5.54621

So indeed, I think there is something funky going on in the generalstats export table... will need to wait for Phil unfortunately :\

xavierrocarada commented 2 years ago

James, thank you very much for having a look at this! Is there a possible way to get the generals stats from the multiqc_data.json file? Or is it better to wait for Phil to have a look at it?

jfy133 commented 2 years ago

If you're familiar with R and JSON files you can reconstruct it (my R example above basically gives you the general gist.

g-pacheco commented 2 years ago

Hej @xavierrocarada, could you please let me know how you created the .csv file? Is there an option in MultiQC? I have also been trying to do that. Thanks!

g-pacheco commented 2 years ago

@ewels & @jfy133 maybe you would also know how to export this .csv file?

jfy133 commented 2 years ago

Using the toolkit/toolbox on the right side of the multiqc, should somewhere give you the option to export the data

fgvieira commented 2 years ago

Could not find anything on the toolbox, only to download plot data. The only way I could find was to copy the general stats table (copy button) and paste on an empty document, but that is not very practical...

jfy133 commented 2 years ago

Oops sorry, I probably should have double checked and not replied from my phone, you're right I guess there is only the copy general stats table (or just use the file multiqc_data/mqc_general_stats.txt, if it exists)

xavierrocarada commented 2 years ago

Sorry for the late reply. If you use the toolbox on the right side, there is an export tab. On the top part there are two tabs: Images and Data. If you press "Data" you can download the general_stats_table in three different formats: tab-separated; comma-separated and JSON. If you select comma-separated, you'll get a .csv file. Yes, you can just copy the general stats table if you have it, but you can't do that if you have a beeswarm plot because some data will be missing in the exported file... This bug has not been fixed yet...

jfy133 commented 2 years ago

@xavierrocarada I also thought that but I don't see that option in 3 different examples I've looked at today (unless I'm being blind):

image

There are all the plots but not general stats

xavierrocarada commented 2 years ago

True, you do not have this option if you already have a table that you can copy. However, if you have a beeswarm plot, you have this option:

Screen Shot 2021-11-02 at 7 06 46 pm
g-pacheco commented 2 years ago

Hello!

Thanks very much for all the help. We have found the multiqc_data/mqc_general_stats.txt file and it is just what we were looking for!

Best regards, George.

xavierrocarada commented 2 years ago

Hi @ewels,

Do you have any improvement regarding this issue? I have a whole new dataset and this issue is still happening.

Cheers, Xavier

ewels commented 2 years ago

Not yet sorry. Maybe @ErikDanielsson this is something that you could take a look at?

ErikDanielsson commented 2 years ago

Sure!

ewels commented 2 years ago

Hi all,

Coming back to this now. Does anyone have some input data to reproduce this error? It's great that we have the MultiQC outputs attached, but ideally I want to be able to run MultiQC myself and observe the error. Then I can work on fixing it.

The example report seems to have been generated with MultiQC v1.10.1 and a lot has changed since then..

Phil