freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
140 stars 24 forks source link

Sample_ID from samples file not saved to VCF -file #20

Closed kihear closed 4 years ago

kihear commented 4 years ago

First of all, thank You very much for this excellent pipeline!

I have been able to convert idat files successfully to GTC and during the conversion, iaap-cli recognises the sample ID from samples file successfully. How ever, when converting from GTC to VCF, ID is set back to "SentrixBarcode_A_SentrixPosition_A"

Samples CSV file is structured as follows:

[Data] Sample_ID,SentrixBarcode_A,SentrixPosition_A,Path

During the iaap-cli conversion i get message: info: ArrayAnalysis.NormToGenCall.Services.NormToGenCallSvc[0] [07:09:03 1893]: Writing [Sample_ID_Obfuscated] to gtc...

when I query the IDs from the converted VCF file: bcftools query -l I get: [SentrixBarcodeA][SentrixPosition_A] [SentrixBarcodeA][SentrixPosition_A] [SentrixBarcodeA][SentrixPosition_A] .....

I know I can annotate VCF IDs again, but would rather form a pipeline where this is not nescessary.

freeseek commented 4 years ago

Hi Kimmo. I have noticed that some IDAT files contain sample name information (together with sample plate and sample well) but others do not. If this information is present it is carried on in the GTC file. I am not sure what is the cause of these discrepancies. As many IDAT files do not contain sample name information, and also due to the fact that [SentrixBarcode][SentrixPosition] is better guaranteed to be a unique identifier, which is required by the VCF specification, the default mode is to use the GTC file name which should correspond to [SentrixBarcode][SentrixPosition]. I have now added the option --use-gtc-sample-names which will use the GTC sample name instead, if present. Are you able to rebuild the plugin from source to test if it works for you?

kihear commented 4 years ago

Hi Kimmo. I have noticed that some IDAT files contain sample name information (together with sample plate and sample well) but others do not. If this information is present it is carried on in the GTC file. I am not sure what is the cause of these discrepancies. As many IDAT files do not contain sample name information, and also due to the fact that [SentrixBarcode][SentrixPosition] is better guaranteed to be a unique identifier, which is required by the VCF specification, the default mode is to use the GTC file name which should correspond to [SentrixBarcode][SentrixPosition]. I have now added the option --use-gtc-sample-names which will use the GTC sample name instead, if present. Are you able to rebuild the plugin from source to test if it works for you?

Thank You for this update! I didn't get very deeply into the source code and that's why I missed the fact that sample ID is generated from filename. This is also a feasible workaround to rename the files automatically after converting to GTC -files. I'll try recompiling and utilizing the new attribute too.

Thank You for Your support!