Harmonize TARGET - part2 DNA

kgaonkar6 commented 3 years ago

This section is required for PM/ADAPT team

[Required] For what study (study_id) is this request?

Alignment open-target-target-alignment-wxs-normal open-target-target-alignment-wxs-tumor

Somatic analysis open-target-target-somatic-mutations-wxs-tumor

[Required] Basic study information

[x] Is this an OT or Kids First study:
[ ] If KF, PI name and fiscal year:
[ ] Study short name:

NOTE: Study short name is required for D3B study, but optional for Kids First X01; Fiscal year is required for Kids First X01, but optional for D3B study. Those information will be used for project naming in Cavatica.

[Required] What workflow needs to be done?

For reference: https://www.notion.so/d3b/workflow-type-365a604413534b2b9d93c969557d60a2

[x] alignment
[ ] gatk-hc-gvcf
[ ] family-vcf-genotyping
[ ] cohort-vcf-genotyping
[ ] single-vcf-genotyping
[x] somatic-analysis
[ ] rnaseq-analysis
[ ] only delivery of the source data
[ ] rnaseq toolkit run ( this would be D3b toolkit modules like collapse RSEM files , merge RSEM files , merge fusion calls )
[ ] dnaseq toolkit run ( this would be D3b toolkit modules like consensus CNV analysis, merge consensus maf files )

[Required] Requested sample manifest.

v6 histologies.tsv s3://kf-openaccess-us-east-1-prd-pbta/open-targets/v6/histologies.tsv

For somatic analysis I've matched by Kids_First_Participant_ID with the following code and generated this list : target_t_n_matches.txt

histology %>%
  # filter for TARGET and remove RNA-Seq
  filter(cohort=="TARGET", experimental_strategy != "RNA-Seq") %>% 
  # group by experimental_strategy and Kids_First_Participant_ID to gather tumor-normal
  group_by(experimental_strategy, Kids_First_Participant_ID) %>% 
  # summarise counts 
  summarise(n=n(), 
            # gather sample_type
            sample_type=toString(sample_type), 
            # gather cancer_group
            cancer_group=toString(unique(cancer_group))) %>% 
  # filyer for Tumor, Normal pairs
  # there are more than 2 `Tumor, Normal, Normal` and `Tumor, Tumor, Normal` aggreagtes
  filter(grepl("Tumor, Normal", sample_type), n>=2) %>% 
  # save 
  write_tsv("~/Documents/PedOpenTargets/OpenPBTA-analysis/data/target_t_n_matches.txt")

[ ] The total T-N matches are as follows, please verify Since AML and NB is already processed, we will need to only run Acute Lymphoblastic Leukemia, Osteosarcoma and Wilms tumor as part2. But could you verify the counts for AML and NB if it is not too much work.

Acute Lymphoblastic Leukemia, NA       Acute Myeloid Leukemia, NA                    Neuroblastoma 
                             289                               16                              222 
               Neuroblastoma, NA                 Osteosarcoma, NA                  Wilms tumor, NA 
                             499                               86                               43

[ ] The bed files are discussed before Targeted capture #52 and WXS bed #53

[Required] Billing group for this study?

If there's no billing group setup for this study or project yet, please follow the D3b Billing Workflow guidelines and submit your request at D3b Billing Group Creation board

[Optional] List bucket names for source files.

[Optional] What is the priority for the analysis? (Low, Medium, High)

[Optional] How long do you think this work will take?

[Optional] Who will complete this work?

This section is ONLY REQUIRED for Bixops manager/operations

[Required] Provide Cavatica project link.

[Required] Provide details for analysis workflow(s).

jharenza commented 3 years ago

cc @yuankunzhu

bmennis commented 3 years ago

I have copied all WXS .bam TARGET samples from the Cavatica datasets to the tumor WXS alignment Cavatica project. I then used the v6 histologies file to search for samples from Osteosarcoma, ALL, and Wilms tumor. For this I used the column marked Kids_First_Biospecimen_ID and the values there to search file Aliquot IDs. For the disease type of interest I moved the bam files to the appropriate folder for ALL, Osteosarcoma, and Wilms_tumor.

The sample file counts for each folder are:

ALL - 299 bam files
Osteosarcoma - 82 bam files
Wilms Tumor - 46 bam files

These totals differ from counts for WXS samples in both the v6 histologies file and the manifest attached to this ticket:

Osteosarcoma:
target_t_n_matches.txt - 86
v6_histologies.tsv - 90

Wilms Tumor:
target_t_n_matches.txt - 43
v6_histologies.tsv - 50

ALL:
target_t_n_matches.txt - 289
v6_histologies.tsv - 308

Additionally, there is a discrepancy with tumor normal pairs for the samples in each disease type. Here are the tumor and normal counts for each disease type in the Cavatica project:

Note that these file counts reflect total files so bam + bai index files
```
Wilms Tumor 
Tumor - 90
Normal - 2
```

ALL Tumor - 598 Normal - 0

Osteosarcoma Tumor - 164 Normal - 0


Also note for the ALL samples that there are primary blood derived cancer samples for both bone marrow, and peripheral boold and last reccurnet blood derived cancer - bone marrow samples.

I guess at this point I am trying to figure out if these are correct to start the analysis or if I have missed something in which I need to re import or sort these files.

Here is a link to the project that has the files under the different folders for disease type:
https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-target-alignment-wxs-tumor/files/#q

jharenza commented 3 years ago

Hi @bmennis - I do not have access to that project, but what do you mean by:

ALL: target_t_n_matches.txt - 289 v6_histologies.tsv - 308

and then:

ALL Tumor - 598 Normal - 0

I did a quick check on v6 and v7 histologies for ALL, and found 289 normals (v6/7) which correspond to 572 (v6) and 564 (v7) tumor samples. Atttached are those 289 normals IDs. normal_acute_lymph_leuk_bs_ids_v7.csv. The v7 histologies file can be found here: s3://kf-openaccess-us-east-1-prd-pbta/open-targets/v7/histologies.tsv

all_pts <- v7 %>%
  filter(pathology_diagnosis == "Acute Lymphoblastic Leukemia") %>%
  pull(Kids_First_Participant_ID) %>%
  unique()

normal_all <- v7 %>%
  filter(sample_type == "Normal" & Kids_First_Participant_ID %in% all_pts) %>%
  select(Kids_First_Biospecimen_ID, Kids_First_Participant_ID, sample_type, aliquot_id, composition) %>%
  write_csv("normal_acute_lymph_leuk_bs_ids_v7.csv")

Does this help? Do you know if there are FASTQ files associated with WXS as well as BAMs, which may make up the remaining missing samples?

jharenza commented 3 years ago

possible UI issue linking GDC to SBG - @yuankunzhu and @zhangb1 to check on this

afarrel commented 3 years ago

Example of osteosarcoma WXS normals: TARGET-40-0A4HLD-10A-01D TARGET-40-0A4HMC-10A-01D TARGET-40-0A4HX8-10A-01D TARGET-40-0A4HXS-10A-01D TARGET-40-0A4HY5-10A-01D TARGET-40-0A4I0Q-10A-01D TARGET-40-0A4I0W-10A-01D TARGET-40-0A4I3S-10A-01D TARGET-40-0A4I4E-10A-01D TARGET-40-0A4I4M-10A-01D TARGET-40-0A4I4O-10A-01D TARGET-40-0A4I5B-10A-01D TARGET-40-0A4I6O-10A-01D TARGET-40-0A4I8U-10A-01D TARGET-40-0A4I9K-10A-01D TARGET-40-0A4I42-10A-01D TARGET-40-0A4I48-10A-01D TARGET-40-0A4I65-10A-01D TARGET-40-PAKFVX-10A-01D TARGET-40-PAKUZU-10A-01D TARGET-40-PAKXLD-10A-01D TARGET-40-PAKZZK-10A-01D TARGET-40-PALECC-10A-01D TARGET-40-PALFYN-10A-01D TARGET-40-PALHRL-10A-01D TARGET-40-PALKDP-10A-01D TARGET-40-PALKGN-10A-01D TARGET-40-PALWWX-10A-01D TARGET-40-PALZGU-10A-01D TARGET-40-PAMEKS-10A-01D TARGET-40-PAMHLF-10A-01D TARGET-40-PAMHYN-10A-01D TARGET-40-PAMJXS-10A-01D TARGET-40-PAMLKS-10A-01D TARGET-40-PAMRHD-10A-01Y TARGET-40-PAMTCM-10A-01Y TARGET-40-PAMYYJ-10A-01D TARGET-40-PANGPE-10A-01D TARGET-40-PANGRW-10A-01Y TARGET-40-PANMIG-10A-01Y TARGET-40-PANPUM-10A-01D TARGET-40-PANSEN-10A-01D TARGET-40-PANVJJ-10A-01D TARGET-40-PANXSC-10A-01D TARGET-40-PANZHX-10A-01Y TARGET-40-PANZZJ-10A-01Y TARGET-40-PAPFLB-10A-01D TARGET-40-PAPIJR-10A-01D TARGET-40-PAPKWD-10A-01Y TARGET-40-PAPNVD-10A-01D TARGET-40-PAPVYW-10A-01Y TARGET-40-PAPWWC-10A-01Y TARGET-40-PAPXGT-10A-01D TARGET-40-PARBGW-10A-01D TARGET-40-PARDAX-10A-01D TARGET-40-PARFTG-10A-01D TARGET-40-PARGTM-10A-01D TARGET-40-PARJXU-10A-01D TARGET-40-PARKAF-10A-01D TARGET-40-PASEBY-10A-01D TARGET-40-PASEFS-10A-01D TARGET-40-PASFCV-10A-01D TARGET-40-PASKZZ-10A-01D TARGET-40-PASNZV-10A-01D TARGET-40-PASRNE-10A-01Y TARGET-40-PASSLM-10A-01D TARGET-40-PASUUH-10A-01Y TARGET-40-PASYUK-10A-01Y TARGET-40-PATAWV-10A-01Y TARGET-40-PATEEM-10A-01Y TARGET-40-PATJVI-10A-01Y TARGET-40-PATKSS-10A-01D TARGET-40-PATMIF-10A-01D TARGET-40-PATMPU-10A-01D TARGET-40-PATMXR-10A-01D TARGET-40-PATPBS-10A-01D TARGET-40-PATUXZ-10A-01D TARGET-40-PATXFN-10A-01D TARGET-40-PAUBIT-10A-01Y TARGET-40-PAUTWB-10A-01D TARGET-40-PAUTYB-10A-01D TARGET-40-PAUUML-10A-01D TARGET-40-PAUVUL-10A-01D TARGET-40-PAUXPZ-10A-01D TARGET-40-PAUYTT-10A-01D TARGET-40-PAVALD-10A-01D TARGET-40-PAVCLP-10A-01D TARGET-40-PAVDTY-10A-01D TARGET-40-PAVECB-10A-01D

bmennis commented 3 years ago

Ok I made up some lists of files in the WXS tumor alignment project for the three different disease types. I used the v6 histologies file to search for the tumor and any matching normal samples in the project. It looks like for both Osteosarcoma and ALL that there are some normal samples that either do not have a tumor sample in the histologies file or we do not have the file. For Wilms tumor there are some normal samples missing tumors like the other disease types along with two tumor samples that do not have a normal sample in the histology file.

I have the files attached here: ALL_histology_sampled_pairs.txt Osteosarcoma_histology_sampled_pairs.txt Wilms_tumor_histology_sampled_pairs.txt

bmennis commented 3 years ago

I noticed a bug in the script I used to generate the previous lists in which samples that were missing tumor or normal files sometimes were excluded. I wanted these files to list all tumor normal pairs despite the presence of these files in the Cavatica project so that we could know which files we have and what we dont. I believe I have corrected the lists and have them below now:

ALL_histology_sampled_pairs.txt Osteosarcoma_histology_sampled_pairs.txt Wilms_tumor_histology_sampled_pairs.txt

bmennis commented 3 years ago

Ill also upload my script I used to generate these files here: get_tum_norm_pairs_histologies.txt

jharenza commented 3 years ago

thanks @bmennis - are you saying that our totals for these pairs of WXS samples overlapping between histologies.tsv and GDC are as below:

> all %>%
+   filter(!is.na(Tumor_file_name) & !is.na(Norm_file_name)) %>%
+   nrow()
[1] 291
> 
> os %>%
+   filter(!is.na(Tumor_file_name) & !is.na(Norm_file_name)) %>%
+   nrow()
[1] 77
> 
> wilms %>%
+   filter(!is.na(Tumor_file_name) & !is.na(Norm_file_name)) %>%
+   nrow()
[1] 45

From the original table:

> v7 %>%
+   filter(sample_type == "Tumor" & experimental_strategy == "WXS" & cohort == "TARGET") %>%
+   select(cancer_group, experimental_strategy) %>%
+   table()
                              experimental_strategy
cancer_group                   WXS
  Acute Lymphoblastic Leukemia 308
  Acute Myeloid Leukemia        31
  Neuroblastoma                222
  Osteosarcoma                  90
  Wilms tumor                   50

That seems pretty good to me! cc @chinwallaa

bmennis commented 3 years ago

I just wanted to update that the alignment of the tumor WXS samples is complete, I am patching metadata to the outputs which is taking some time due to the large number of samples and outputs.

I am working on getting the normal samples to run now, I will update with developments.

bmennis commented 3 years ago

I am finishing up the WXS normal sample alignment harmonizations and all samples are completed successfully except one sample that I am not able to harmonize. The aliquot ID for the normal sample is TARGET-50-PAJNNR-10A-01D and the case ID is TARGET-50-PAJNNR. The issue seems to be that i do not have access to the file, so when I try to download it there is an error from Cavatica and when I try to run the alignment workflow on it, it fails saying access to it is denied. This does not seem to make sense as I have access to the other samples so I am not entirely sure how to fix this.

I will patch metadata to the other alignment outputs in the meantime and see if there is something I can do about this.

zhangb1 commented 3 years ago

@jharenza whether you can look at that sample? ^ the file is

47f76f61-5d88-504d-95de-e779355042ed_wxs_gdc_realn.bam

I tried to re-import and still failed. not sure what's the issue.

chinwallaa commented 3 years ago

@afarrel does anyone have credentials to download this from GDC directly https://portal.gdc.cancer.gov/files/faaf9e26-05c5-486c-aec4-11e1fa45f8df instead of via the Cavatica link ?

bmennis commented 3 years ago

It looks like I can download it dowload it from the GDC. It will take some time but afterwards I can upload it to Cavatica.

bmennis commented 3 years ago

The last normal WXS sample is finished aligning, I used the sample downloaded from the GDC. There appears to be an issue with Cavatica that the samples cannot be accessed despite Cavatica and CGC showing that Target access is granted.

The aligned reads for the normal and tumor samples, along with the raw gvcf files for the normals have all been copied to the delivery project, here are links to both of those folder for the outputs: https://cavatica.sbgenomics.com/u/d3b-bixu/open-target/files/#q?path=harmonized-data%2Faligned-reads&page=1 https://cavatica.sbgenomics.com/u/d3b-bixu/open-target/files/#q?path=harmonized-data%2Fraw-gvcf

I am going to work on the somatic analysis for the paired samples now and will update with developments. I will see if the access issue can be fixed so that I may run analysis.

zhangb1 commented 3 years ago

after checking the NGScheckmate,here is the total pairs for each disease type. we will run somatic workflow for those pairs.

  Acute Lymphoblastic Leukemia 291
  Osteosarcoma                  77
  Wilms tumor                   45

jharenza commented 3 years ago

@bmennis can you provide a final list of samples which you have run with this iteration for @ewafula

zhangb1 commented 3 years ago

@jharenza Brian has the laptop issue, I just posted the tumor and normal aliquots id below which processed for this part2

Tumor: 430 Normal:400

run_normal.txt run_tumor.txt

and the pairs information is belw, somatic analysis is still in process. total of 413 pairs tumor_normal_pairs.txt

@ewafula ^

ewafula commented 3 years ago

Thanks, @zhangb1 !

On Thu, Aug 12, 2021 at 9:31 AM Bo Zhang @.***> wrote:

@jharenza https://github.com/jharenza Brian has the laptop issue, I just posted the tumor and normal aliquots id below which processed for this part2

Tumor: 430 Normal:400

run_normal.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6975790/run_normal.txt run_tumor.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6975793/run_tumor.txt

and the pairs information is belw: total of 413 pairs tumor_normal_pairs.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6975801/tumor_normal_pairs.txt

@ewafula https://github.com/ewafula ^

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PediatricOpenTargets/ticket-tracker/issues/111#issuecomment-897642320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26BMXJGNVG3IUKEBQTLT4PEL7ANCNFSM5AL2LMSQ .

ewafula commented 3 years ago

@jharenza Brian has the laptop issue, I just posted the tumor and normal aliquots id below which processed for this part2

Tumor: 430 Normal:400

run_normal.txt run_tumor.txt

and the pairs information is belw, somatic analysis is still in process. total of 413 pairs tumor_normal_pairs.txt

@ewafula ^

@zhangb1, just for the record. Only 3 cancer types are represented in the TARGET WXS data. You mention `Part2`, is there another list for `Part1`?	TARGET Cancer Type	Normal
Acute lymphoblastic leukemia	281	303
Osteosarcoma	74	82
Wilms tumor	45	45
Acute myeloid leukemia	0	0
Neuroblastoma	0	0
Rhabdoid tumor	0	0
Clear cell sarcoma of the kidney	0	0

cc @jharenza

jharenza commented 3 years ago

@zhangb1, just for the record. Only 3 cancer types are represented in the TARGET WXS data. You mention Part2, is there another list for Part1?

Part 1 was released with v7, AML + NBL. We will do a part 3, which will be remaining ALL, so you should have what you need now.

ewafula commented 3 years ago

Ok, thanks!

zhangb1 commented 3 years ago

The somatic analysis has been done, but only get 412 pairs results, one pair failed : tumor:TARGET-50-PAKGZX-01A-01D normal:TARGET-50-PAKGZX-10A-01D

https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-target-somatic-mutations-wxs-tumor/tasks/fa9dd283-47dc-4d07-bc7d-d2a15db69284/

by looking at the error and the alignment metrics the tumor sample. TARGET-50-PAKGZX-01A-01D seems NOT a paired end bam file.

we may need to skip this pair for the v8 release. @jharenza @afarrel

jharenza commented 3 years ago

Thanks @zhangb1 also cc @ewafula to exclude from the v8 histology file

ewafula commented 3 years ago

@jharenza, I will exclude this pair of tumor and normal and update PR. Question, for the v8 update to filter out wxs samples that are not in @zhangb1 final list of read files from GDC, was I supposed keep all samples on his list or only samples his list with tumor-normal pair?

kgaonkar6 commented 3 years ago

Hi @ewafula , is there a ticket for all the histology file update for v8? @zhangb1 noticed that GMKF WGS were missing germline_sex_estimate which he has updated now and we would like that to be updated in v8 histologies.tsv file release as well.

ewafula commented 3 years ago

Hi @ewafula , is there a ticket for all the histology file update for v8? @zhangb1 noticed that GMKF WGS were missing germline_sex_estimate which he has updated now and we would like that to be updated in v8 histologies.tsv file release as well.

@kgaonkar6, there is no one particular ticker for v8 updates. They are spread over multiple tickets with other issues. Please open one and include a mapping file of bs_ids to germline_sex_estimate to update v8/histologies.tsv. I have made all required updates so far and will update once you open a ticket.

jharenza commented 3 years ago

@jharenza, I will exclude this pair of tumor and normal and update PR. Question, for the v8 update to filter out wxs samples that are not in @zhangb1 final list of read files from GDC, was I supposed keep all samples on his list or only samples his list with tumor-normal pair?

We only have tumor normal methods so far- it may be the case that there is a relapse tumor or primary tumor on his list which we don't have access to but we still have the normal and other tumor, so I think whatever is on the list would be excluded? Cc @zhangb1 to confirm

zhangb1 commented 3 years ago

@jharenza @ewafula

We may keep other not pair samples too. since we did the alignment analysis for those too. just for the record . we may have the paired tumor or normal sample in the future?

jharenza commented 3 years ago

@zhangb1 i guess the list we want is the list of T/N on which we have somatic calls, whether there is N1/T1 and N1/T2. We will keep all info in a separate file, but only those which have somatic data in the v8 histologies file just because it will affect the independent sample selection module. Could you provide that list?

zhangb1 commented 3 years ago

@jharenza @ewafula

after removing the one has error, the tota l is 412 pairs. there are some Tumors using the same Normal.

412_pairs_task_info.txt

jharenza commented 3 years ago

@jharenza @ewafula

after removing the one has error, the tota l is 412 pairs. there are some Tumors using the same Normal.

412_pairs_task_info.txt

Perfect, thank you

ewafula commented 3 years ago

Thanks @zhangb1, I had already removed the problematic pair and update v8 histologies file.

On Mon, Aug 16, 2021 at 10:19 AM Jo Lynne Rokita @.***> wrote:

@jharenza https://github.com/jharenza @ewafula https://github.com/ewafula

after removing the one has error, the tota l is 412 pairs. there are some Tumors using the same Normal.

412_pairs_task_info.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6992958/412_pairs_task_info.txt

Perfect, thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PediatricOpenTargets/ticket-tracker/issues/111#issuecomment-899550138, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26A4I4S55RXPTX7HA4LT5EM65ANCNFSM5AL2LMSQ .

runjin326 commented 3 years ago

Closed with PR89

d3b-center / ticket-tracker-OPC