Information about Supplementary_Table2_datasetsQC.xlsx

smanfri commented 1 year ago

Good morning,

I'm a student in Computer Science at Università degli Studi di Milano and for my thesis I am assessing some pipeline for the analysis of SARS-CoV-2 samples. In order to select the best pipeline for our requirements, I'm using the benchmark datasets available here. I found the Supplementary_table2 in your paper (Xiaoli L, Hagey JV, Park DJ, Gulvik CA, Young EL, Alikhan N-F, Lawsin A, Hassell N, Knipe K, Oakeson KF, Retchless AC, Shakya M, Lo C-C, Chain P, Page AJ, Metcalf BJ, Su M, Rowell J, Vidyaprakash E, Paden CR, Huang AD, Roellig D, Patel K, Winglee K, Weigand MR, Katz LS. 2022. Benchmark datasets for SARS-CoV-2 surveillance bioinformatics. PeerJ 10:e13821 http://doi.org/10.7717/peerj.13821) and I would like to use also the data contained there for evaluations (and not only the file in.tsv available for every dataset). I'm writing here because I can't understand how the column 'Total reads' is calculated. In particular, I used FastQC (the value of the field 'Total Sequences') to compute this value and I also counted the reads in the original .FASTQ file but the numbers don't correspond to the ones published in the Supplementary_table2.

Do you know why the numbers are different? Is it possible that Supplementary_table2 is outdated with respect to the current version of the dataset? If this is the case, which version of the dataset is matched to Supplementary_table2 and used in your paper?

Thank you very much for your time :)

Best regards, Sara Manfredi

lskatz commented 1 year ago

Hi thank you for identifying this discrepancy. Although I can't promise to fix this right now, it might be helpful to post here some values you are finding in FastQC vs what you are seeing in the supplementary. Thank you for your help.

smanfri commented 1 year ago

Hi, thank you for the response. In the attached file, I compared the total reads reported in the supplementary table 2 and the value found by the tool FastQC (version 0.11.8). Note that:

In the file there are the results for the benchmark “CoronaHiT-rapid” for the Illumina sequences
Even if we take the values reported by FastQC for the raw samples (before the trimming) the values reported in the Supplementary table 2 are always bigger

Thank you for the attention, Sara Total-reads_Supplementary-table2_VS_FastQC.xlsx

CDCgov / datasets-sars-cov-2

Information about Supplementary_Table2_datasetsQC.xlsx #34