Updated analysis: pedcbio-sample-name

d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT

2 stars 0 forks source link

Updated analysis: pedcbio-sample-name #532

Closed migbro closed 1 year ago

migbro commented 1 year ago

What analysis module should be updated and why?

pedcbio-sample-name It has a bug

What changes need to be made? Please provide enough detail for another participant to make the update.

At the end:

histology_all_fixed %>%
  dplyr::filter(cohort != "TCGA") %>%
  readr::write_tsv(file.path(results_dir, "histologies-formatted-id-added.tsv"))

Causes TCGA samples to be dropped. This used to be a desired behavior, no longer is. Simply removing dplyr::filter(cohort != "TCGA") %>% should do the trick. Also, some entries are repeated for some reason. Need to investigate why

What input data should be used? Which data were used in the version being updated?

hisolotiges pre-v12 release

When do you expect the revised analysis will be completed?

Next week maybe?

Who will complete the updated analysis?

@migbro

migbro commented 1 year ago

Quick update - repeat entries happened because old files existed in input dir causing dupes to occur

jharenza commented 1 year ago

@kelseykeith would you be able to look into this / work with Miguel

kelseykeith commented 1 year ago

Sure I can work on this

migbro commented 1 year ago

FYI, I have a branch here: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/bug/mb-fix-pedcbio-sample where the cohort issue is easily fixed. Might be nice though to make this script less dependent on needing to be within a git repo, etc as it seems overkill to load a 10+GB repo to process a 16MB file.

kelseykeith commented 1 year ago

@migbro can you be more specific about what's duplicated? Because I don't see any issues with duplication. The output is the histologies table with an additional formatted_sample_id column added on; some of those ids are in the table multiple times but as far as I can see that's expected behavior because each id is unique for each Kids_First_Participant_ID. It was the same in the previous v11 version of the table.

migbro commented 1 year ago

@kelseykeith see this comment:

Quick update - repeat entries happened because old files existed in input dir causing dupes to occur

Basically, I didn't realize that the input dir had existing csv files. So, I had created a new one that overlapped, causing the script to read in all csv and re-process some of the IDs. Once I cleared out the dir, and used only the table of interest, the issue resolved itself. I also added a fix to handle DGD sample naming when it's not already in the predefined file. I think really the best thing that could happen to this script is to allow it to run as standalone and not have to be inside a git repo, especially this one. It's pretty lightweight and it seems pretty aggressive to grab the whole repo to process a few megs of data. However, that's more a quality of life issue and opinion in a sense, as maybe it's desired for this to follow the usual framework to track new inputs used as well as new output generated

kelseykeith commented 1 year ago

Ok, makes sense now, I had branched off of what you already did

Are there other issues you need help with, or is this pretty much resolved?

migbro commented 1 year ago

I think it's resolved, thanks!

migbro commented 1 year ago

PR merged, closing ticket!