cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
582 stars 451 forks source link

Fix the skcm_tcga study #3069

Closed priti88 closed 6 years ago

priti88 commented 6 years ago

Error in the importer is creating duplicated sample with different IDs. Bug reported by user: https://groups.google.com/forum/#!topic/cbioportal/dmMG9QrNDCw

pieterlukasse commented 6 years ago

Hi @priti88, can you identify the bug in the importer? Afaik it is quite common to find patients with both 01 and 06 samples.

priti88 commented 6 years ago

@pieterlukasse I am not sure. It might not be an importer bug. @angelicaochoa is looking into this further.

ao508 commented 6 years ago

Hi @pieterlukasse , @sheridancbio and I have been investigating why we're seeing this sample TCGA-HR-A2OG-01 when the GDC portal only reports TCGA-HR-A2OG-06 for patient TCGA-HR-A2OG.

All genomic data for patient TCGA-HR-A2OG is linked to sample id TCGA-HR-A2OG-06 except for CNA and mutations. I did a quick survey of the data we received from Firehose and found that the mutations are in fact linked to sample TCGA-HR-A2OG-01, which is why we see two samples for patient TCGA-HR-A2OG.

@sheridancbio may have more to update soon.

aindacochea commented 6 years ago

Hello angelicaochoa

Thank you for explain me what is happening.

1.- If I understood correctly from this patient the CNA and mutations are from TCGA-HR-A2OG-01 (thats mean a primary melanoma because in other samples the -01 code means primary tumor), meanwhile the other data like RNAseq and protein expression data are from TCGA-HR-A2OG-06 (thats mean a metastatic melanoma sample because in other samples the code -06 means metastasis ).

So thats mean this patient have two samples.

But if you enter in the pathologist report seems to be the same sample (metastatic report in the primary tumor).

2.-I saw similar cases in melanoma database from TCGA using cBIO portal.

The patients with two samples are:

TCGA-D3-A1Q6 TCGA-D3-A1QA TCGA-D9-A1X3 TCGA-ER-A19T TCGA-ER-A2NF TCGA-HR-A2OG TCGA-HR-A2OH TCGA-XV-AB01

I would like to know how you did it, in order to face this problem in the future with other databases

3.- In the clinical database when you download the information this patient sample TCGA-HR-A2OG-01 is not labeled as a primary sample (. In patients with two samples we have the same situation.

Is it possible to correct this?

Thank you very much for your time

DarioS commented 6 years ago

The reason the identifiers are different is because there was a sample mix-up discovered last year. Because the cancer sample was originally thought to be a primary cancer, but is actually a metastasis, the 01 code was changed to a 06 code. The problem arises because cBioPortal depends on the old datasets before TCGA Data Portal closed down and reappeared as Genomic Data Commons. Note that GDC also has the updated barcodes for these patients with noted discrepancies.

TCGA also discourages the identification of samples by barcodes and encourages the use of UUID, which doesn't change if sample mix-ups are identified after the sample barcode has been publicly issued and become widely used.

aindacochea commented 6 years ago

Thank you DarioS for your advice. I have another doubt, if you see up in angelicaocoha comment :

"...All genomic data for patient TCGA-HR-A2OG is linked to sample id TCGA-HR-A2OG-06 except for CNA and mutations. I did a quick survey of the data we received from Firehose and found that the mutations are in fact linked to sample TCGA-HR-A2OG-01, which is why we see two samples for patient TCGA-HR-A2OG."

My question is :

Could you tell me in these cases (in melanoma TCGA database) if the studies (CNA, Mutation, RNAseq and protein expression) are from primary and the metastatic sample (like the way she did it before) using cBIO portal? (I'm clinician and is very easy to extract the info in using this)

The patients with two samples are:

TCGA-D3-A1Q6 TCGA-D3-A1QA TCGA-D9-A1X3 TCGA-ER-A19T TCGA-ER-A2NF TCGA-HR-A2OG TCGA-HR-A2OH TCGA-XV-AB01

ao508 commented 6 years ago

Hi @aindacochea,

Through the GDC portal site (https://portal.gdc.cancer.gov/projects/TCGA-SKCM) I was able to map the following:

TCGA-D3-A1Q6

TCGA-D3-A1QA

TCGA-D9-A1X3

TCGA-ER-A19T

TCGA-ER-A2NF

TCGA-HR-A2OG

TCGA-HR-A2OH

TCGA-XV-AB01

aindacochea commented 6 years ago

Thank you very much Angelica

Reminding your previous answer in the case of that patient he have 2 samples and each sample had different molecular analysis (one sample had mutation profile, meanwhile the other had other genomic test, RNA expression and protein expression)

"All genomic data for patient TCGA-HR-A2OG is linked to sample id TCGA-HR-A2OG-06 except for CNA and mutations. I did a quick survey of the data we received from Firehose and found that the mutations are in fact linked to sample TCGA-HR-A2OG-01, which is why we see two samples for patient TCGA-HR-A2OG."

Could you tell me in patients with 2 samples(in melanoma TCGA database) if the studies (CNA, Mutation, RNAseq and protein expression) are from primary and the metastatic sample (like the way you did it before in the patient TCGA-HR-A2OG )

Thank you

Alberto

2017-11-07 17:04 GMT+01:00 angelicaochoa notifications@github.com:

Hi @aindacochea https://github.com/aindacochea,

Through the GDC portal site I was able to map the following:

TCGA-D3-A1Q6

  • TCGA-HR-A2OG-06 (Met)

TCGA-D3-A1QA

  • TCGA-D3-A1QA-07 (Met)
  • TCGA-D3-A1QA-06 (Met)

TCGA-D9-A1X3

  • TCGA-D9-A1X3-06 (Met)

TCGA-ER-A19T

  • TCGA-ER-A19T-01 (Primary)
  • TCGA-ER-A19T-06 (Met)

TCGA-ER-A2NF

  • TCGA-ER-A2NF-01 (Primary)
  • TCGA-ER-A2NF-06 (Met)

TCGA-HR-A2OG

  • TCGA-HR-A2OG-06 (Met)

TCGA-HR-A2OH

  • TCGA-HR-A2OH-06 (Met)

TCGA-XV-AB01

  • TCGA-XV-AB01-06 (Met)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cBioPortal/cbioportal/issues/3069#issuecomment-342530754, or mute the thread https://github.com/notifications/unsubscribe-auth/AeespmGkccmK4iBwwpF9rs1YRK50HcaKks5s0H9zgaJpZM4PNOZz .

ao508 commented 6 years ago

@aindacochea As @DarioS mentioned, there was a sample mix up and in the case of TCGA-HR-A2OG. For example, the sample identifier was changed from -01 to -06, so the data for both -01 and -06 belong to the same sample identifier (-06).

I believe this to be the same case for the patients I listed above containing only a single sample identifier. I cross referenced this information with what was available in the GDC portal TCGA-SKCM project.

For the following cases I would resolve all sample identifiers to the sample identifier provided in this list:

TCGA-D3-A1Q6

TCGA-D9-A1X3

TCGA-HR-A2OG

TCGA-HR-A2OH

TCGA-XV-AB01

The other patients listed in my previous comment do in fact have more than one sample. These ones however can be resolved to the single sample identifier provided here. Example, for patient TCGA-D9-A1X3 resolve all sample identifiers to TCGA-D9-A1X3-06.