covingto / pancanmafmerge

A repo for code specifically tailored to the merger of the TCGA PanCan VCF (per-caller) files into a MAF file. This repository borrows from code available in other repositories and has some custom code to handle the merger. The project's main purpose is simply to do the merge for this one project and therefore has no features to make it more generic or maintainable beyond this goal.
6 stars 4 forks source link

better glob pattern for run-batch.py #3

Closed covingto closed 8 years ago

covingto commented 8 years ago

glob patterns in run-batch.py don't cover all cases (line 46 in dispatch function). Some studies seem to include two types of files for each caller. glob pattern is likely too simple, may need to replace with more logic to isolate the correct / best file to use. The dispatch function must generate two lists, one of vcf files and one of caller tags to be submitted to merge.py

kellrott commented 8 years ago

Do you have some example os samples that this happens on?

covingto commented 8 years ago

This is a CHOL directory:

ll data/CHOL/0775583e-c0a0-4f18-9ca2-8f89cedce3d6_01_11/
total 59712
-rw-rw-r-- 1 covingto can 196106 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.muse.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 54737791 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.mutect.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 134585 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.pindel.somatic.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 149918 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.SomaticSniper.annotated.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 331483 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.varscan.indel.annotated.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 2103318 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.varscan.snp.annotated.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 79550 Mar 14 22:25 filtered.radia.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 48915 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.muse.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 135143 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.pindel.somatic.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 96189 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.SomaticSniper.annotated.tcga_filtered.reheadered.vcf -rw-rw-r-- 1 covingto can 336028 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.varscan.indel.annotated.tcga_filtered.reheadered.vcf -rw-rw-r-- 1 covingto can 1796255 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.varscan.snp.annotated.tcga_filtered.reheadered.vcf

only one mutect and one radia but the others have two files. I guess I could take the younger one, but what do you think?

singerma commented 8 years ago

The duplicated studies actually have different files. If you take a look at the freeze list, both the initial freeze list and the second freeze list both contain this patient and samples. However, the files are different. It appears that while the data was in the initial freeze, it was then co cleaned and put into the second freezelist as well.

One problem here, is that certain files will end up overwriting each other. For example, the radia files are the same name, so whichever file was downloaded last when I did this will overwrite the one downloaded first. I think the right way to do this is to amend the freeze lists. Find all samples which were in both the first and second lists, remove from the first list, and regenerate the pile of vcfs.

covingto commented 8 years ago

OK that sounds reasonable. Right now I'm skipping these for MAF merger. Is this something that you can do or do we need to get the rest of the group involved to get you the freeze lists?

Thanks Kyle C

On Tue, Apr 5, 2016 at 11:01 AM, singerma notifications@github.com wrote:

The duplicated studies actually have different files. If you take a look at the freeze list, both the initial freeze list and the second freeze list both contain this patient and samples. However, the files are different. It appears that while the data was in the initial freeze, it was then co cleaned and put into the second freezelist as well.

One problem here, is that certain files will end up overwriting each other. For example, the radia files are the same name, so whichever file was downloaded last when I did this will overwrite the one downloaded first. I think the right way to do this is to amend the freeze lists. Find all samples which were in both the first and second lists, remove from the first list, and regenerate the pile of vcfs.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/covingto/pancanmafmerge/issues/3#issuecomment-205870678

singerma commented 8 years ago

I have the freeze lists, I will figure out what lines to remove and email the larger group to make sure that those files are okay to be removed.

covingto commented 8 years ago

Dropped in place of driving the merger with specific manifest files. So no real need for a glob pattern.