Closed cbirger closed 7 years ago
Yes, the code does appear to be in error and I will test more.
But, by inspection of the legacy TCGA data I am of the strong opinion that this error would not have had any effect on harmonized data from GDC. That is, on the Broad network the following command
egrep "H-....-..$" ~mnoble/runs/sampleReports/latest/filteredSamples.2016_07_15__00_00_14.txt | cut -f1,4-6 | grep Analy
shows that in the legacy data there were only a handful of replicates (6 total) where H analytes were selected, and in each case they were selected over another H analyte by virtue of having a higher plate number (not for precedence over R or T analytes).
Similarly, this command
egrep "R-....-..$" ~mnoble/runs/sampleReports/latest/filteredSamples.2016_07_15__00_00_14.txt | cut -f1,4-6 | grep Analy
likewise shows that of the 112 replicates where R analytes were selected, zero of them were selected INSTEAD of H analytes. Some were selected over T analytes but most were selected for higher plate numbers.
I've confirmed a similar pattern in the harmonized GDC data diced by GDCtools, so thankfully this bug would not have any effect on folks who have already downloaded and are using GDCtools.
This is fixed in the repo.
I was looking at diced_file_comparator and description of the heuristic for selecting the "best" file described in:
https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-replicateFilteringQ Whatdoyoudowhenmultiplealiquotbarcodesexistforagivensampleportionanalytecombination
Since the H analyte is preferred over R and T analytes for RNA, shouldn't the comparator return -1 rather than 1 if anlayte1 = 'H'?
see below...
def diced_file_comparator(a, b): '''Comparator function for barcodes, using the rules described in the GDAC FAQ entry for replicate samples: https://confluence.broadinstitute.org/display/GDAC/FAQ '''