dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

Genotype column name #223

Open jimhavrilla opened 4 years ago

jimhavrilla commented 4 years ago

Hey guys,

Probably a stupid question but is there a way (or a simple to add feature?) to choose the name of the genotype fields when creating a merged pVCF? For example, I am trying to create a pVCF using UK Biobank data, and there are ~50,000 gVCFs, and I need to use the filename of the .gz files because they have a specific encrypted ID that is specific to my project. The IDs inside the file have been extracted in your command, which I am guessing is the correct use in most cases, but in my case those IDs are wrong and specific to whomever made the gVCFs in the first place.

e.g. for file NEWNUMBER_23176_0_0.gz

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  UKB_SOME_NUMBER

replace with

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NEWNUMBER

Is there a simple fix for this?

Best regards,

Jim

jimhavrilla commented 4 years ago

Well I actually wrote a cluster script to pigz decompress | sed fix | bgzip recompress all the files, so I guess I'm good but I can leave this open if you still think it's a worthwhile and easy to add feature.

mlin commented 4 years ago

That sounds fine but I believe bcftools reheader can do it without even decompressing the data.

You are right though we didn't anticipate this ID mapping roadbump, and other groups will probably encounter it too. Is there a specific format in which everybody is likely to have this mapping in hand? We can look at doing the mapping on ingestion if the feature would be leveraged in that way, otherwise reheader seems okay if it's going to be a bunch of unique situations.

jimhavrilla commented 4 years ago

Wow I was not aware of that bcftools function. Probably would have been much easier because of the block decompression.

Basically the sample IDs in the header are wrong and needed to be replaced with the filename IDs.

I agree with your assessment that bcftools is probably good enough if it can do that quickly and probably isn't worth your time then.

Thanks for getting back to me.

Jim Havrilla

On Sat, Jun 6, 2020, 7:13 AM Mike Lin notifications@github.com wrote:

That sounds fine but I believe bcftools reheader http://samtools.github.io/bcftools/bcftools.html#reheader can do it without even decompressing https://github.com/samtools/bcftools/blob/dccba248adb120f5abc6929ba8175d694cb273be/reheader.c#L417 the data.

You are right though we didn't anticipate this ID mapping roadbump, and other groups will probably encounter it too. Is there a specific format in which everybody is likely to have this mapping in hand? We can look at doing the mapping on ingestion if the feature would be leveraged in that way, otherwise reheader seems okay if it's going to be a bunch of unique situations.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/223#issuecomment-640041213, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBHTOQED3QNUEAESVZLRVIQHFANCNFSM4NTDWR7A .