broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

possible bug in diced_file_comparator #42

Closed cbirger closed 7 years ago

cbirger commented 7 years ago

I was looking at diced_file_comparator and description of the heuristic for selecting the "best" file described in:

https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-replicateFilteringQ Whatdoyoudowhenmultiplealiquotbarcodesexistforagivensampleportionanalytecombination

Since the H analyte is preferred over R and T analytes for RNA, shouldn't the comparator return -1 rather than 1 if anlayte1 = 'H'?

see below...

def diced_file_comparator(a, b): '''Comparator function for barcodes, using the rules described in the GDAC FAQ entry for replicate samples: https://confluence.broadinstitute.org/display/GDAC/FAQ '''

# Convert files to barcodes by splitting
a = a.split('.')[0]
b = b.split('.')[0]

# Get the analytes and plates
# TCGA-BL-A0C8-01A-11<Analyte>-<plate>-01
analyte1 = a[19]
analyte2 = b[19]
plate1   = a[21:25]
plate2   = b[21:25]

# Equals case
if a == b:
    return 0
elif analyte1 == analyte2:
    # Prefer the aliquot with the highest lexicographical sort value
    return -1 if a >= b else 1
elif analyte1 == "H":
    # Prefer H over R and T
    return 1
elif analyte1 == "R":
    # Prefer R over T
    return -1 if analyte2 == "T" else 1
elif analyte1 == "T":
    # Prefer H and R over T
    return 1
elif analyte1 == "D":
    # Prefer D over G,W,X, unless plate number is higher
    return -1 if plate2 <= plate1 else 1
elif analyte2 == "D":
    return 1 if plate1 <= plate2 else -1
else:
    # Default back to highest lexicographical sort value
    return -1 if a >= b else 1
noblem commented 7 years ago

Yes, the code does appear to be in error and I will test more.

But, by inspection of the legacy TCGA data I am of the strong opinion that this error would not have had any effect on harmonized data from GDC. That is, on the Broad network the following command

egrep "H-....-..$" ~mnoble/runs/sampleReports/latest/filteredSamples.2016_07_15__00_00_14.txt | cut -f1,4-6 | grep Analy

shows that in the legacy data there were only a handful of replicates (6 total) where H analytes were selected, and in each case they were selected over another H analyte by virtue of having a higher plate number (not for precedence over R or T analytes).

Similarly, this command

egrep "R-....-..$" ~mnoble/runs/sampleReports/latest/filteredSamples.2016_07_15__00_00_14.txt | cut -f1,4-6 | grep Analy

likewise shows that of the 112 replicates where R analytes were selected, zero of them were selected INSTEAD of H analytes. Some were selected over T analytes but most were selected for higher plate numbers.

I've confirmed a similar pattern in the harmonized GDC data diced by GDCtools, so thankfully this bug would not have any effect on folks who have already downloaded and are using GDCtools.

noblem commented 7 years ago

This is fixed in the repo.