brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
270 stars 36 forks source link

Is there a way to rename samplename within .somalier files? #138

Open SpikyClip opened 4 months ago

SpikyClip commented 4 months ago

Hi,

First off, this has been an amazingly useful tool for my work, really appreciate it!

So I didn't realise at the time that samplenames are hardcoded at the somalier relate stage (i.e. renaming the somalier files does not affect the output of relate. If I'm not wrong it actually gets the name from within the VCF/BAM?). This results in issues if a sample was run multiple times across batches and you try to relate them across batches.

Is there any way of renaming the samplename within the .somalier binary files? If I could write a script that recursively looped through my batch folders, appending the batch_id and date_processed, I had run hundreds of these on a per-batch basis. I understand that the appropriate way is to have set output-prefix in the somalier extract stage, but I'd rather not have to recall all these bams/vcfs to rerun somalier if possible.

It would be great if somalier relate had some sort of --samplename-from-filename flag that would rely on the filename for the samplename (though admittedly, it feels a little hacky). Or a simple --samplesheet samplenames.csv that maps two columns, sample,somalier_path for renaming.

brentp commented 4 months ago

Hi, glad to hear it's useful. you could use this python script: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py to see the format of the .somalier files, specifically the read_somalier function. You could then write out with a new name and name length with all else mostly the same. you can reverse with int.to_bytes and arr.tobytes() to reverse the operations you see there.

SpikyClip commented 4 months ago

Thanks for the response, I'll give it a shot when I have the time and let you know how I go.