Exporting general stats table - missingness

xavierrocarada commented 3 years ago

Description of bug

general_stats_table.csv PT_aDNA_1_multiqc_report.html.zip I have attached the html report and the exported general stats table in a comma-separated format. All the samples are in the csv file, but some of them do not have any value in some columns and I know that these values have been calculated because I have seen them in the beeswarm plot. For instance, sample 22219 does not have any information in the Endogenous DNA (%) column and the value is plotted in the beeswarm. Why do data miss in the exported file?

File that triggers the error

No response

MultiQC Error log

No response

jfy133 commented 2 years ago

@xavierrocarada can you send the entire MultiQC results folder? And also if you still have it the .nextflow.log of that run?

xavierrocarada commented 2 years ago

multiqc.zip @jfy133 Thanks for having a look at it! :) I have attached the multiqc results folder, but I am sorry to say that I do not have the .nextflow.log of that run... :(

jfy133 commented 2 years ago

Ok, so the information IS in multiqc_data.json

            "22219": {
                "endogenous_dna": 7.429926,
                "endogenous_dna_post": 5.54621
            },

However, when I search for those two values in the exported CSV/TSV files, those values are associated with: 22204_S57_L001_R2_001.

Sample	% Duplicate Reads	Average % GC Content	Average Sequence Length (bp)	Percentage of modules failed in FastQC report (includes those not plotted here)	Total Sequences ()	Duplication rate before filtering	Percentage of reads > Q30 after filtering	Bases > Q30 after filtering (millions)	GC content after filtering	Percent reads passing filter	% trimmed reads	Total trimmed reads ()	% Duplicate Reads	Average % GC Content	Average Sequence Length (bp)	Percentage of modules failed in FastQC report (includes those not plotted here)	Total Sequences ()	Total reads in the bam file ()	Reads Mapped in the bam file ()	Total reads in the bam file ()	Reads Mapped in the bam file ()	Percentage of reads categorised as a technical duplicate	CF~1 means high library complexity. Large CF means not worth sequencing deeper.	Non-unique reads removed after deduplication ()	Unique mapping reads after deduplication ()	3 Prime 1st base substitution frequency for G>A	3 Prime 2nd base substitution frequency for G>A	5 Prime 1st base substitution frequency for C>T	5 Prime 2nd base substitution frequency for C>T	Read length std. dev.	Median read length	Mean read length	Average coverage (X) on mitochondrial genome.	Average coverage (X) on nuclear genome.	Mitochondrial to nuclear reads ratio (MTNUC)	Reads on the nuclear genome ()	Reads on the mitochondrial genome ()	Mean GC content	Fraction of genome with at least 1X coverage	Fraction of genome with at least 2X coverage	Fraction of genome with at least 3X coverage	Fraction of genome with at least 4X coverage	Fraction of genome with at least 5X coverage	Median coverage	Mean coverage	% mapped reads	Number of mapped reads ()	Number of reads ()	Alignment error rate. Total edit distance (SAM NM field) over the number of mapped bases	Rate of Error for Chr X	Rate of Error for Chr Y	Number of positions on Chromosome X vs Autosomal positions.	Number of positions on Chromosome Y vs Autosomal positions.	#SNPs Covered	#SNPs Total	Endogenous DNA (%)	Endogenous DNA Post (%)	Number of SNPs	Contamination Estimate (Method1_MOM)	Estimate Error (Method1_MOM)	Contamination Estimate (Method1_ML)	Estimate Error (Method1_ML)	Contamination Estimate (Method2_MOM)	Estimate Error (Method2_MOM)	Contamination Estimate (Method2_ML)	Estimate Error (Method2_ML)
22204_S57_L001_R2_001	6.93469429562835	49	101	18.1818181818182	8428752	7.31127	92.6158	1395.931856	55.7652	99.885818341311	98.3596809666944	18904052	1.52336861404443	57	46.8097996027694	9.09090909090909	1229009	13249591	8502941	24595	24595	9	1.09	152467	1497370	12.4848190429925	2.03599881971083	11.587147030185	1.98637911464245	18.4223847868222	43	47.9921496473478	0.028788701792504	0.000235751625664	122.11	15783	9	46.6830699774266	0.221776984778939	0.000756186263807	0.000108272486329	5.0869263521002E-05	2.97056100260488E-05	0	0.0022	100	140126	140126	0.45	0.034336204644391	0.057566950886416	0.682406704462112	0.181985732122619	88628	53227092	7.429926	5.54621	1	0	N/A	0	0	0	N/A	0	0

BUT, When I look in the JSON that you can export, however, this looks to be correct:

> library(jsonlite)
> res <- read_json("~/Downloads/general_stats_table.json")
> what$categories ## here I scrolled to find the endorspy pre/post columns, which is under element '56'
> what$samples[[56]][[51]] ## find the sample names in column 56, I identify 22219 in 51
[1] "22219"
> what$datasets[[56]][[51]]
[1] 7.429926
> what$datasets[[57]][[51]]
[1] 5.54621

So indeed, I think there is something funky going on in the generalstats export table... will need to wait for Phil unfortunately :\

xavierrocarada commented 2 years ago

James, thank you very much for having a look at this! Is there a possible way to get the generals stats from the multiqc_data.json file? Or is it better to wait for Phil to have a look at it?

jfy133 commented 2 years ago

If you're familiar with R and JSON files you can reconstruct it (my R example above basically gives you the general gist.

g-pacheco commented 2 years ago

Hej @xavierrocarada, could you please let me know how you created the .csv file? Is there an option in MultiQC? I have also been trying to do that. Thanks!

g-pacheco commented 2 years ago

@ewels & @jfy133 maybe you would also know how to export this .csv file?

jfy133 commented 2 years ago

Using the toolkit/toolbox on the right side of the multiqc, should somewhere give you the option to export the data

fgvieira commented 2 years ago

Could not find anything on the toolbox, only to download plot data. The only way I could find was to copy the general stats table (copy button) and paste on an empty document, but that is not very practical...

jfy133 commented 2 years ago

Oops sorry, I probably should have double checked and not replied from my phone, you're right I guess there is only the copy general stats table (or just use the file multiqc_data/mqc_general_stats.txt, if it exists)

xavierrocarada commented 2 years ago

Sorry for the late reply. If you use the toolbox on the right side, there is an export tab. On the top part there are two tabs: Images and Data. If you press "Data" you can download the general_stats_table in three different formats: tab-separated; comma-separated and JSON. If you select comma-separated, you'll get a .csv file. Yes, you can just copy the general stats table if you have it, but you can't do that if you have a beeswarm plot because some data will be missing in the exported file... This bug has not been fixed yet...

jfy133 commented 2 years ago

@xavierrocarada I also thought that but I don't see that option in 3 different examples I've looked at today (unless I'm being blind):

There are all the plots but not general stats

xavierrocarada commented 2 years ago

True, you do not have this option if you already have a table that you can copy. However, if you have a beeswarm plot, you have this option:

g-pacheco commented 2 years ago

Hello!

Thanks very much for all the help. We have found the multiqc_data/mqc_general_stats.txt file and it is just what we were looking for!

Best regards, George.

xavierrocarada commented 2 years ago

Hi @ewels,

Do you have any improvement regarding this issue? I have a whole new dataset and this issue is still happening.

Cheers, Xavier

ewels commented 2 years ago

Not yet sorry. Maybe @ErikDanielsson this is something that you could take a look at?

ErikDanielsson commented 2 years ago

Sure!

ewels commented 2 years ago

Hi all,

Coming back to this now. Does anyone have some input data to reproduce this error? It's great that we have the MultiQC outputs attached, but ideally I want to be able to run MultiQC myself and observe the error. Then I can work on fixing it.

The example report seems to have been generated with MultiQC v1.10.1 and a lot has changed since then..

Phil

MultiQC / MultiQC