QuackenbushLab / NetworkDataCompanion

An R library of utilities for performing analyses on TCGA and GTEx data using the Network Zoo
GNU General Public License v3.0
4 stars 0 forks source link

Duplicate removal should be done on sample and not the patient #17

Closed violafanfani closed 1 year ago

violafanfani commented 1 year ago

Duplicate removal on depth of sequencing is done on the patient (first 12 characters of the barcode), but it should be done on the first 16, taking into account the sample type.

katehoffshutta commented 1 year ago

Is this taking the first 15 and not 16? Any reason? (in https://github.com/pmandros/NetSciDataCompanion/blob/main/R/NetSciDataCompanion.R)

extractSampleAndType = function(TCGA_barcodes){
             return(sapply(TCGA_barcodes, substr, 1, 15))
violafanfani commented 1 year ago

These are the first 16 (dummy) characters of a TCGA barcode "tcga-aa-1020-01a". "tcga-aa-1020" is the sample/patient, "01" is the sample type, "a" is the vial. If you want to extract the sample and type, you won't need the vial

katehoffshutta commented 1 year ago

👍 Sounds good. In that case I am going to close this issue. Thanks Viola!

FischerJoBio commented 1 year ago

I think it should be on the first 15 in that case, as multiple vials are derived from the same sample and hence are duplicates ( see the creating barcodes section here.

violafanfani commented 1 year ago

Extract SampleAndType selects the first 15 characters, ExtractVialOnly extracts up to the 16th character. I think it is correct now (although the names could be confusing). https://github.com/pmandros/NetSciDataCompanion/blob/3f065d742b0c592884897cec2d558299307f7142/R/NetSciDataCompanion.R#L97C1-L108C14 I think the confusion comes from my initial statement that duplicates should be done on the first 16 characters. What I mean is that we have to include the sample type and only remove based on vial (16th character).

FischerJoBio commented 1 year ago

Thanks for the clarification, that makes sense then!