Closed kgaonkar6 closed 3 years ago
cc @yuankunzhu
I have copied all WXS
.bam
TARGET samples from the Cavatica datasets to the tumor WXS alignment Cavatica project. I then used the v6 histologies file to search for samples from Osteosarcoma, ALL, and Wilms tumor. For this I used the column marked Kids_First_Biospecimen_ID
and the values there to search file Aliquot IDs. For the disease type of interest I moved the bam files to the appropriate folder for ALL
, Osteosarcoma
, and Wilms_tumor
.
The sample file counts for each folder are:
ALL - 299 bam files
Osteosarcoma - 82 bam files
Wilms Tumor - 46 bam files
These totals differ from counts for WXS
samples in both the v6 histologies file and the manifest attached to this ticket:
Osteosarcoma:
target_t_n_matches.txt - 86
v6_histologies.tsv - 90
Wilms Tumor:
target_t_n_matches.txt - 43
v6_histologies.tsv - 50
ALL:
target_t_n_matches.txt - 289
v6_histologies.tsv - 308
Additionally, there is a discrepancy with tumor normal pairs for the samples in each disease type. Here are the tumor and normal counts for each disease type in the Cavatica project:
Wilms Tumor
Tumor - 90
Normal - 2
ALL Tumor - 598 Normal - 0
Osteosarcoma Tumor - 164 Normal - 0
Also note for the ALL samples that there are primary blood derived cancer samples for both bone marrow, and peripheral boold and last reccurnet blood derived cancer - bone marrow samples.
I guess at this point I am trying to figure out if these are correct to start the analysis or if I have missed something in which I need to re import or sort these files.
Here is a link to the project that has the files under the different folders for disease type:
https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-target-alignment-wxs-tumor/files/#q
Hi @bmennis - I do not have access to that project, but what do you mean by:
ALL: target_t_n_matches.txt - 289 v6_histologies.tsv - 308
and then:
ALL Tumor - 598 Normal - 0
I did a quick check on v6 and v7 histologies for ALL, and found 289 normals (v6/7) which correspond to 572 (v6) and 564 (v7) tumor samples. Atttached are those 289 normals IDs. normal_acute_lymph_leuk_bs_ids_v7.csv. The v7 histologies file can be found here: s3://kf-openaccess-us-east-1-prd-pbta/open-targets/v7/histologies.tsv
all_pts <- v7 %>%
filter(pathology_diagnosis == "Acute Lymphoblastic Leukemia") %>%
pull(Kids_First_Participant_ID) %>%
unique()
normal_all <- v7 %>%
filter(sample_type == "Normal" & Kids_First_Participant_ID %in% all_pts) %>%
select(Kids_First_Biospecimen_ID, Kids_First_Participant_ID, sample_type, aliquot_id, composition) %>%
write_csv("normal_acute_lymph_leuk_bs_ids_v7.csv")
Does this help? Do you know if there are FASTQ files associated with WXS as well as BAMs, which may make up the remaining missing samples?
possible UI issue linking GDC to SBG - @yuankunzhu and @zhangb1 to check on this
Example of osteosarcoma WXS normals: TARGET-40-0A4HLD-10A-01D TARGET-40-0A4HMC-10A-01D TARGET-40-0A4HX8-10A-01D TARGET-40-0A4HXS-10A-01D TARGET-40-0A4HY5-10A-01D TARGET-40-0A4I0Q-10A-01D TARGET-40-0A4I0W-10A-01D TARGET-40-0A4I3S-10A-01D TARGET-40-0A4I4E-10A-01D TARGET-40-0A4I4M-10A-01D TARGET-40-0A4I4O-10A-01D TARGET-40-0A4I5B-10A-01D TARGET-40-0A4I6O-10A-01D TARGET-40-0A4I8U-10A-01D TARGET-40-0A4I9K-10A-01D TARGET-40-0A4I42-10A-01D TARGET-40-0A4I48-10A-01D TARGET-40-0A4I65-10A-01D TARGET-40-PAKFVX-10A-01D TARGET-40-PAKUZU-10A-01D TARGET-40-PAKXLD-10A-01D TARGET-40-PAKZZK-10A-01D TARGET-40-PALECC-10A-01D TARGET-40-PALFYN-10A-01D TARGET-40-PALHRL-10A-01D TARGET-40-PALKDP-10A-01D TARGET-40-PALKGN-10A-01D TARGET-40-PALWWX-10A-01D TARGET-40-PALZGU-10A-01D TARGET-40-PAMEKS-10A-01D TARGET-40-PAMHLF-10A-01D TARGET-40-PAMHYN-10A-01D TARGET-40-PAMJXS-10A-01D TARGET-40-PAMLKS-10A-01D TARGET-40-PAMRHD-10A-01Y TARGET-40-PAMTCM-10A-01Y TARGET-40-PAMYYJ-10A-01D TARGET-40-PANGPE-10A-01D TARGET-40-PANGRW-10A-01Y TARGET-40-PANMIG-10A-01Y TARGET-40-PANPUM-10A-01D TARGET-40-PANSEN-10A-01D TARGET-40-PANVJJ-10A-01D TARGET-40-PANXSC-10A-01D TARGET-40-PANZHX-10A-01Y TARGET-40-PANZZJ-10A-01Y TARGET-40-PAPFLB-10A-01D TARGET-40-PAPIJR-10A-01D TARGET-40-PAPKWD-10A-01Y TARGET-40-PAPNVD-10A-01D TARGET-40-PAPVYW-10A-01Y TARGET-40-PAPWWC-10A-01Y TARGET-40-PAPXGT-10A-01D TARGET-40-PARBGW-10A-01D TARGET-40-PARDAX-10A-01D TARGET-40-PARFTG-10A-01D TARGET-40-PARGTM-10A-01D TARGET-40-PARJXU-10A-01D TARGET-40-PARKAF-10A-01D TARGET-40-PASEBY-10A-01D TARGET-40-PASEFS-10A-01D TARGET-40-PASFCV-10A-01D TARGET-40-PASKZZ-10A-01D TARGET-40-PASNZV-10A-01D TARGET-40-PASRNE-10A-01Y TARGET-40-PASSLM-10A-01D TARGET-40-PASUUH-10A-01Y TARGET-40-PASYUK-10A-01Y TARGET-40-PATAWV-10A-01Y TARGET-40-PATEEM-10A-01Y TARGET-40-PATJVI-10A-01Y TARGET-40-PATKSS-10A-01D TARGET-40-PATMIF-10A-01D TARGET-40-PATMPU-10A-01D TARGET-40-PATMXR-10A-01D TARGET-40-PATPBS-10A-01D TARGET-40-PATUXZ-10A-01D TARGET-40-PATXFN-10A-01D TARGET-40-PAUBIT-10A-01Y TARGET-40-PAUTWB-10A-01D TARGET-40-PAUTYB-10A-01D TARGET-40-PAUUML-10A-01D TARGET-40-PAUVUL-10A-01D TARGET-40-PAUXPZ-10A-01D TARGET-40-PAUYTT-10A-01D TARGET-40-PAVALD-10A-01D TARGET-40-PAVCLP-10A-01D TARGET-40-PAVDTY-10A-01D TARGET-40-PAVECB-10A-01D
Ok I made up some lists of files in the WXS tumor alignment project for the three different disease types. I used the v6 histologies file to search for the tumor and any matching normal samples in the project. It looks like for both Osteosarcoma and ALL that there are some normal samples that either do not have a tumor sample in the histologies file or we do not have the file. For Wilms tumor there are some normal samples missing tumors like the other disease types along with two tumor samples that do not have a normal sample in the histology file.
I have the files attached here: ALL_histology_sampled_pairs.txt Osteosarcoma_histology_sampled_pairs.txt Wilms_tumor_histology_sampled_pairs.txt
I noticed a bug in the script I used to generate the previous lists in which samples that were missing tumor or normal files sometimes were excluded. I wanted these files to list all tumor normal pairs despite the presence of these files in the Cavatica project so that we could know which files we have and what we dont. I believe I have corrected the lists and have them below now:
ALL_histology_sampled_pairs.txt Osteosarcoma_histology_sampled_pairs.txt Wilms_tumor_histology_sampled_pairs.txt
Ill also upload my script I used to generate these files here: get_tum_norm_pairs_histologies.txt
thanks @bmennis - are you saying that our totals for these pairs of WXS samples overlapping between histologies.tsv
and GDC are as below:
> all %>%
+ filter(!is.na(Tumor_file_name) & !is.na(Norm_file_name)) %>%
+ nrow()
[1] 291
>
> os %>%
+ filter(!is.na(Tumor_file_name) & !is.na(Norm_file_name)) %>%
+ nrow()
[1] 77
>
> wilms %>%
+ filter(!is.na(Tumor_file_name) & !is.na(Norm_file_name)) %>%
+ nrow()
[1] 45
From the original table:
> v7 %>%
+ filter(sample_type == "Tumor" & experimental_strategy == "WXS" & cohort == "TARGET") %>%
+ select(cancer_group, experimental_strategy) %>%
+ table()
experimental_strategy
cancer_group WXS
Acute Lymphoblastic Leukemia 308
Acute Myeloid Leukemia 31
Neuroblastoma 222
Osteosarcoma 90
Wilms tumor 50
That seems pretty good to me! cc @chinwallaa
I just wanted to update that the alignment of the tumor WXS samples is complete, I am patching metadata to the outputs which is taking some time due to the large number of samples and outputs.
I am working on getting the normal samples to run now, I will update with developments.
I am finishing up the WXS normal sample alignment harmonizations and all samples are completed successfully except one sample that I am not able to harmonize. The aliquot ID for the normal sample is TARGET-50-PAJNNR-10A-01D
and the case ID is TARGET-50-PAJNNR
. The issue seems to be that i do not have access to the file, so when I try to download it there is an error from Cavatica and when I try to run the alignment workflow on it, it fails saying access to it is denied. This does not seem to make sense as I have access to the other samples so I am not entirely sure how to fix this.
I will patch metadata to the other alignment outputs in the meantime and see if there is something I can do about this.
@jharenza whether you can look at that sample? ^ the file is
47f76f61-5d88-504d-95de-e779355042ed_wxs_gdc_realn.bam
I tried to re-import and still failed. not sure what's the issue.
@afarrel does anyone have credentials to download this from GDC directly https://portal.gdc.cancer.gov/files/faaf9e26-05c5-486c-aec4-11e1fa45f8df instead of via the Cavatica link ?
It looks like I can download it dowload it from the GDC. It will take some time but afterwards I can upload it to Cavatica.
The last normal WXS sample is finished aligning, I used the sample downloaded from the GDC. There appears to be an issue with Cavatica that the samples cannot be accessed despite Cavatica and CGC showing that Target access is granted.
The aligned reads for the normal and tumor samples, along with the raw gvcf files for the normals have all been copied to the delivery project, here are links to both of those folder for the outputs: https://cavatica.sbgenomics.com/u/d3b-bixu/open-target/files/#q?path=harmonized-data%2Faligned-reads&page=1 https://cavatica.sbgenomics.com/u/d3b-bixu/open-target/files/#q?path=harmonized-data%2Fraw-gvcf
I am going to work on the somatic analysis for the paired samples now and will update with developments. I will see if the access issue can be fixed so that I may run analysis.
after checking the NGScheckmate,here is the total pairs for each disease type. we will run somatic workflow for those pairs.
Acute Lymphoblastic Leukemia 291
Osteosarcoma 77
Wilms tumor 45
@bmennis can you provide a final list of samples which you have run with this iteration for @ewafula
@jharenza Brian has the laptop issue, I just posted the tumor and normal aliquots id below which processed for this part2
Tumor: 430 Normal:400
and the pairs information is belw, somatic analysis is still in process. total of 413 pairs tumor_normal_pairs.txt
@ewafula ^
Thanks, @zhangb1 !
On Thu, Aug 12, 2021 at 9:31 AM Bo Zhang @.***> wrote:
@jharenza https://github.com/jharenza Brian has the laptop issue, I just posted the tumor and normal aliquots id below which processed for this part2
Tumor: 430 Normal:400
run_normal.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6975790/run_normal.txt run_tumor.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6975793/run_tumor.txt
and the pairs information is belw: total of 413 pairs tumor_normal_pairs.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6975801/tumor_normal_pairs.txt
@ewafula https://github.com/ewafula ^
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PediatricOpenTargets/ticket-tracker/issues/111#issuecomment-897642320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26BMXJGNVG3IUKEBQTLT4PEL7ANCNFSM5AL2LMSQ .
@jharenza Brian has the laptop issue, I just posted the tumor and normal aliquots id below which processed for this part2
Tumor: 430 Normal:400
and the pairs information is belw, somatic analysis is still in process. total of 413 pairs tumor_normal_pairs.txt
@ewafula ^
@zhangb1, just for the record. Only 3 cancer types are represented in the TARGET WXS data. You mention Part2 , is there another list for Part1 ? |
TARGET Cancer Type | Normal | Tumor |
---|---|---|---|
Acute lymphoblastic leukemia | 281 | 303 | |
Osteosarcoma | 74 | 82 | |
Wilms tumor | 45 | 45 | |
Acute myeloid leukemia | 0 | 0 | |
Neuroblastoma | 0 | 0 | |
Rhabdoid tumor | 0 | 0 | |
Clear cell sarcoma of the kidney | 0 | 0 |
cc @jharenza
@zhangb1, just for the record. Only 3 cancer types are represented in the TARGET WXS data. You mention Part2, is there another list for Part1?
Part 1 was released with v7, AML + NBL. We will do a part 3, which will be remaining ALL, so you should have what you need now.
Ok, thanks!
The somatic analysis has been done, but only get 412 pairs results, one pair failed : tumor:TARGET-50-PAKGZX-01A-01D normal:TARGET-50-PAKGZX-10A-01D
by looking at the error and the alignment metrics the tumor sample. TARGET-50-PAKGZX-01A-01D
seems NOT a paired end bam file.
we may need to skip this pair for the v8 release. @jharenza @afarrel
Thanks @zhangb1 also cc @ewafula to exclude from the v8 histology file
@jharenza, I will exclude this pair of tumor and normal and update PR. Question, for the v8 update to filter out wxs samples that are not in @zhangb1 final list of read files from GDC, was I supposed keep all samples on his list or only samples his list with tumor-normal pair?
Hi @ewafula , is there a ticket for all the histology file update for v8? @zhangb1 noticed that GMKF WGS were missing germline_sex_estimate which he has updated now and we would like that to be updated in v8 histologies.tsv file release as well.
Hi @ewafula , is there a ticket for all the histology file update for v8? @zhangb1 noticed that GMKF WGS were missing germline_sex_estimate which he has updated now and we would like that to be updated in v8 histologies.tsv file release as well.
@kgaonkar6, there is no one particular ticker for v8 updates. They are spread over multiple tickets with other issues. Please open one and include a mapping file of bs_ids to germline_sex_estimate
to update v8/histologies.tsv. I have made all required updates so far and will update once you open a ticket.
@jharenza, I will exclude this pair of tumor and normal and update PR. Question, for the v8 update to filter out wxs samples that are not in @zhangb1 final list of read files from GDC, was I supposed keep all samples on his list or only samples his list with tumor-normal pair?
We only have tumor normal methods so far- it may be the case that there is a relapse tumor or primary tumor on his list which we don't have access to but we still have the normal and other tumor, so I think whatever is on the list would be excluded? Cc @zhangb1 to confirm
@jharenza @ewafula
We may keep other not pair samples too. since we did the alignment analysis for those too. just for the record . we may have the paired tumor or normal sample in the future?
@zhangb1 i guess the list we want is the list of T/N on which we have somatic calls, whether there is N1/T1 and N1/T2. We will keep all info in a separate file, but only those which have somatic data in the v8 histologies file just because it will affect the independent sample selection module. Could you provide that list?
@jharenza @ewafula
after removing the one has error, the tota l is 412 pairs. there are some Tumors using the same Normal.
@jharenza @ewafula
after removing the one has error, the tota l is 412 pairs. there are some Tumors using the same Normal.
Perfect, thank you
Thanks @zhangb1, I had already removed the problematic pair and update v8 histologies file.
On Mon, Aug 16, 2021 at 10:19 AM Jo Lynne Rokita @.***> wrote:
@jharenza https://github.com/jharenza @ewafula https://github.com/ewafula
after removing the one has error, the tota l is 412 pairs. there are some Tumors using the same Normal.
412_pairs_task_info.txt https://github.com/PediatricOpenTargets/ticket-tracker/files/6992958/412_pairs_task_info.txt
Perfect, thank you
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PediatricOpenTargets/ticket-tracker/issues/111#issuecomment-899550138, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26A4I4S55RXPTX7HA4LT5EM65ANCNFSM5AL2LMSQ .
This section is required for PM/ADAPT team
[Required] For what study (study_id) is this request?
Alignment open-target-target-alignment-wxs-normal open-target-target-alignment-wxs-tumor
Somatic analysis open-target-target-somatic-mutations-wxs-tumor
[Required] Basic study information
NOTE: Study short name is required for D3B study, but optional for Kids First X01; Fiscal year is required for Kids First X01, but optional for D3B study. Those information will be used for project naming in Cavatica.
[Required] What workflow needs to be done?
For reference: https://www.notion.so/d3b/workflow-type-365a604413534b2b9d93c969557d60a2
[Required] Requested sample manifest.
v6 histologies.tsv
s3://kf-openaccess-us-east-1-prd-pbta/open-targets/v6/histologies.tsv
For somatic analysis I've matched by Kids_First_Participant_ID with the following code and generated this list : target_t_n_matches.txt
Acute Lymphoblastic Leukemia
,Osteosarcoma
andWilms tumor
as part2. But could you verify the counts for AML and NB if it is not too much work.[Required] Billing group for this study?
If there's no billing group setup for this study or project yet, please follow the D3b Billing Workflow guidelines and submit your request at D3b Billing Group Creation board
[Optional] List bucket names for source files.
[Optional] What is the priority for the analysis? (Low, Medium, High)
[Optional] How long do you think this work will take?
[Optional] Who will complete this work?
This section is ONLY REQUIRED for Bixops manager/operations
[Required] Provide Cavatica project link.
[Required] Provide details for analysis workflow(s).