Misplaced False Negative on Figure 3?

peterjc commented 1 year ago

From attempting to recreate the figure from the underlying data (and my own analysis of the FASTQ files), I think there is a misplaced False Negative annotation on Figure 3. I do not believe this change materially affects the paper, but is a point of confusion in attempting to replicate/reproduce the results.

Referring to Figure 3 as shown in the published manuscript https://doi.org/10.1038/s41598-021-85855-6 (Batovska et al. 2021), or here:

The right-most column (12S amplicon), lines 5 to 8 (Pool 2), are mostly pale green (R. padi). There are orange circled minus symbols indicating D. noxia false negatives on lines 6 and 8 (i.e. 250 Pool 2, and 1000 Pool 2).

I believe those should be on lines 7 and 8, i.e. 500 Pool 2 and 1000 Pool 2, show visually here with purple annotation:

figure3_annotated

Zooming in to the original PDF there does appear to be a thin orange line one line 6:

From within rstudio, having run enough of the provided R code to have the df_21s dataframe:

> df_12s[df_12s$Genus=="Diuraphis", c("Sample", "feature", "pool_comp", "Genus", "Abundance")]
         Sample   feature pool_comp     Genus    Abundance
26  Pool-02-100 Pool-0100       2.1 Diuraphis 0.0031758537
25  Pool-U2-250 Pool-0250       2.2 Diuraphis 0.0011390631
36  Pool-04-100 Pool-0100       4.1 Diuraphis 0.0011266682
32  Pool-U3-250 Pool-0250       3.2 Diuraphis 0.0005345783
39  Pool-U4-250 Pool-0250       4.2 Diuraphis 0.0004744621
21  Pool-05-100 Pool-0100       5.1 Diuraphis 0.0000000000
22  Pool-09-500 Pool-0500       4.3 Diuraphis 0.0000000000
23  Pool-01-100 Pool-0100       1.1 Diuraphis 0.0000000000
24 Pool-15-1000 Pool-1000       5.4 Diuraphis 0.0000000000
27 Pool-11-1000 Pool-1000       1.4 Diuraphis 0.0000000000
28  Pool-06-500 Pool-0500       1.3 Diuraphis 0.0000000000
29  Pool-10-500 Pool-0500       5.3 Diuraphis 0.0000000000
30 Pool-12-1000 Pool-1000       2.4 Diuraphis 0.0000000000
31  Pool-U1-250 Pool-0250       1.2 Diuraphis 0.0000000000
33  Pool-03-100 Pool-0100       3.1 Diuraphis 0.0000000000
34  Pool-08-500 Pool-0500       3.3 Diuraphis 0.0000000000
35  Pool-07-500 Pool-0500       2.3 Diuraphis 0.0000000000
37 Pool-13-1000 Pool-1000       3.4 Diuraphis 0.0000000000
38  Pool-U5-250 Pool-0250       5.2 Diuraphis 0.0000000000
40 Pool-14-1000 Pool-1000       4.4 Diuraphis 0.0000000000

My interpretation is those first two lines say Diuraphis noxia 12S was found at about 0.3% in 100-Pool-2 and about 0.1% in 250-Pool-2 (and also 100-Pool-4, 250-Pool-3 and 250-Pool-4 according to the next three lines). The rest are zero, i.e. 500-Pool-2 and 1000-Pool-2 are false negatives.

Possible evidence from output/csv/seperateloci/ earlier in the analysis:

$ cut -d "," -f 2,4,9,14,19,24 12S-Eukaryota_sppglom_filt_summarized.csv | grep -E "(Species|noxia)"
"Species","Pool-02-100","Pool-07-500","Pool-12-1000","Pool-C2-250","Pool-U2-250"
"Diuraphis noxia",0.00312643901020334,NA,NA,0.00116328324050249,0.0011341506181867

If I am interpreting that correctly, Diuraphis noxia was found at about 0.3% in 100-Pool-2, and 0.1% in 250-Pool-2 (both runs), but not in 500-Pool-2 nor 1000-Pool-2

Same with the genus CSV file:

$ cut -d "," -f 2,4,9,14,19,24 12S-Eukaryota_genglom_filt_summarized.csv | grep -E "(Genus|Diuraphis)"
"Genus","Pool-02-100","Pool-07-500","Pool-12-1000","Pool-C2-250","Pool-U2-250"
"Diuraphis",0.00312643901020334,NA,NA,0.00116328324050249,0.0011341506181867

alexpiper commented 1 year ago

Hi Peter,

Well spotted, i think you might be right with this. Im travelling internationally for the next 3 weeks, but when im back in the office i will look into this a bit deeper and discuss with the other authors. This shouldnt impact the papers overall conclusions but a minor correction might be required to fix this potential error.

peterjc commented 1 year ago

For reference, my R script intended to export the data underlying Figure 3 for re-plotting outside of R:

# Export Figure 3 data
#
# Assume's you are in RStudio and have run the Figure 3 section in
# https://github.com/alexpiper/HemipteraMetabarcodingMS/blob/master/hemiptera_metabarcoding.Rmd
#
# We're just aggregating the data already filtered and compiled, and exporting as TSV.
# To simplify linking the tables, aggregating at genus level to pool the two Acizzia spp. etc.

require(comprehenr)

export_ALL <- df_exp[c("pool_comp", "Genus", "Abundance")]
export_ALL <- export_ALL[order(export_ALL$Genus, export_ALL$pool_comp),]
names(export_ALL)[3] <- "Expected"

export_COI <- df_coi[c("pool_comp", "Genus", "Abundance")]
export_COI <- export_COI[order(export_COI$Genus, export_COI$pool_comp),]
export_COI <- aggregate(. ~  pool_comp + Genus, data = export_COI, sum)
assertthat::are_equal(export_COI$pool_comp, export_ALL$pool_comp)
assertthat::are_equal(export_COI$Genus, export_ALL$Genus)
export_ALL["COI"] <- export_COI$Abundance

export_18S <- df_18s[c("pool_comp", "Genus", "Abundance")]
export_18S <- export_18S[order(export_18S$Genus, export_18S$pool_comp),]
export_18S <- aggregate(. ~  pool_comp + Genus, data = export_18S, sum)
assertthat::are_equal(export_18S$pool_comp, export_ALL$pool_comp)
assertthat::are_equal(export_18S$Genus, export_ALL$Genus)
export_ALL["18S"] <- export_18S$Abundance

export_12S <- df_12s[c("pool_comp", "Genus", "Abundance")]
export_12S <- export_12S[order(export_12S$Genus, export_12S$pool_comp),]
export_12S <- aggregate(. ~  pool_comp + Genus, data = export_12S, sum)
assertthat::are_equal(export_12S$pool_comp, export_ALL$pool_comp)
assertthat::are_equal(export_12S$Genus, export_ALL$Genus)
export_ALL["12S"] <- export_12S$Abundance

export_ALL <- export_ALL[order(export_ALL$pool_comp, export_ALL$Genus),]
# Relabel to match desired captions by converting pool_comp which is e.g. 4.5
# for individuals class 4 (i.e. from 100, 250, 500 and 1000) and pool 5:
names(export_ALL)[1] <- "Caption"
export_ALL[1] <- to_vec(for(px in export_ALL[1]) paste(c("100", "250", "500", "1000")[round(10*(px-floor(px)))], "Pool", floor(px)))

write.table(export_ALL, file="figure3.tsv", sep="\t", quote=FALSE, row.names=FALSE)

peterjc commented 1 year ago

The background to this is I'm hoping to include your dataset as a worked example for my tool, https://github.com/peterjc/thapbi-pict/

peterjc commented 1 year ago

@alexpiper have you had a chance to double check Figure 3's false-negative placement yet? Thanks

alexpiper commented 1 year ago

Very sorry for taking this long to get back to you.

Ive confirmed that the Diuraphis noxia false negative for 12S annotated on 250 Pool 2 should instead be annotated on 500 Pool 2.

While checking this i also noticed a couple of other issues:

Bactericera cockerelli is actually absent from 500 Pool 4 in 18S but this wasnt annotated on the figure.
There was a slight error in the way the averages of the 3 genes were calculated for the "all genes combined" part of Figure 3 which slightly changed the percentages.

We are planning a correction to the article but the corresponding author is on long service leave so it may take some time. Here is what the figure should look like following the changes:

Fig3_Mock_Communities_correction

peterjc commented 1 year ago

Thank you Alex.

My alternative analysis concurs with Bactericera cockerelli being absent from 500 Pool 4 in 18S (we differ on a few points but given the flexibility in threshold settings etc there was nothing else which worried me).

I'll try to finish up my work on https://github.com/peterjc/thapbi-pict/pull/515 using this as a worked example, and intend to cite Batovska et al. 2021 in the planned paper.

alexpiper commented 1 year ago

Great to hear the results are generally in agreement. I agree that with the various thresholds and parameters available at each step of the process its difficult to exactly reproduce the results with a different pipeline, particularly when some taxa are so close to the limit of detection/index switching rate.

Im keen to try your tool in future, and will follow with interest

alexpiper / HemipteraMetabarcodingMS

Misplaced False Negative on Figure 3? #8