Closed violafanfani closed 1 year ago
Is this taking the first 15 and not 16? Any reason? (in https://github.com/pmandros/NetSciDataCompanion/blob/main/R/NetSciDataCompanion.R)
extractSampleAndType = function(TCGA_barcodes){
return(sapply(TCGA_barcodes, substr, 1, 15))
These are the first 16 (dummy) characters of a TCGA barcode "tcga-aa-1020-01a". "tcga-aa-1020" is the sample/patient, "01" is the sample type, "a" is the vial. If you want to extract the sample and type, you won't need the vial
👍 Sounds good. In that case I am going to close this issue. Thanks Viola!
I think it should be on the first 15 in that case, as multiple vials are derived from the same sample and hence are duplicates ( see the creating barcodes section here.
Extract SampleAndType selects the first 15 characters, ExtractVialOnly extracts up to the 16th character. I think it is correct now (although the names could be confusing). https://github.com/pmandros/NetSciDataCompanion/blob/3f065d742b0c592884897cec2d558299307f7142/R/NetSciDataCompanion.R#L97C1-L108C14 I think the confusion comes from my initial statement that duplicates should be done on the first 16 characters. What I mean is that we have to include the sample type and only remove based on vial (16th character).
Thanks for the clarification, that makes sense then!
Duplicate removal on depth of sequencing is done on the patient (first 12 characters of the barcode), but it should be done on the first 16, taking into account the sample type.