Closed migbro closed 1 year ago
Quick update - repeat entries happened because old files existed in input
dir causing dupes to occur
@kelseykeith would you be able to look into this / work with Miguel
Sure I can work on this
FYI, I have a branch here: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/bug/mb-fix-pedcbio-sample where the cohort issue is easily fixed. Might be nice though to make this script less dependent on needing to be within a git repo, etc as it seems overkill to load a 10+GB repo to process a 16MB file.
@migbro can you be more specific about what's duplicated? Because I don't see any issues with duplication. The output is the histologies table with an additional formatted_sample_id
column added on; some of those ids are in the table multiple times but as far as I can see that's expected behavior because each id is unique for each Kids_First_Participant_ID
. It was the same in the previous v11 version of the table.
@kelseykeith see this comment:
Quick update - repeat entries happened because old files existed in input dir causing dupes to occur
Basically, I didn't realize that the input
dir had existing csv files. So, I had created a new one that overlapped, causing the script to read in all csv and re-process some of the IDs. Once I cleared out the dir, and used only the table of interest, the issue resolved itself. I also added a fix to handle DGD sample naming when it's not already in the predefined file. I think really the best thing that could happen to this script is to allow it to run as standalone and not have to be inside a git repo, especially this one. It's pretty lightweight and it seems pretty aggressive to grab the whole repo to process a few megs of data. However, that's more a quality of life issue and opinion in a sense, as maybe it's desired for this to follow the usual framework to track new inputs used as well as new output generated
Ok, makes sense now, I had branched off of what you already did
Are there other issues you need help with, or is this pretty much resolved?
I think it's resolved, thanks!
PR merged, closing ticket!
What analysis module should be updated and why?
pedcbio-sample-name It has a bug
What changes need to be made? Please provide enough detail for another participant to make the update.
At the end:
Causes TCGA samples to be dropped. This used to be a desired behavior, no longer is. Simply removing
dplyr::filter(cohort != "TCGA") %>%
should do the trick. Also, some entries are repeated for some reason. Need to investigate whyWhat input data should be used? Which data were used in the version being updated?
hisolotiges pre-v12 release
When do you expect the revised analysis will be completed?
Next week maybe?
Who will complete the updated analysis?
@migbro