brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Sample id duplicates #90

Closed andrewpatto closed 2 years ago

andrewpatto commented 2 years ago

Hi, wondering if you have any particular thoughts about how to handle sample_id duplicates in the cohort.

We will be using a large cohort split across lots of folders - and in the case of sample re-runs etc - will have fingerprint files created with identical sample ids (but living in different buckets/folders/run contexts).

So as an example - running it in a case where I literally duplicated a fingerprint file into 2 locations (but imagining the actual use case to be duplicate sample ids but different actual fingerprints).

HG00097 HG00099 0.011 1345 7614 0.012 6756 6882 13436 2766 5661 5638 2759 16863 33 155-1.0
HG00097 HG00099 0.011 1345 7614 0.012 6756 6882 13436 2766 5661 5638 2759 16863 33 155-1.0
HG00099 HG00099 1.000 0 17115 1.000 6882 6882 13764 6882 5638 5638 5638 17115 0 350-1.0

Because the output sample ids come from the fingerprint file - there would be no way to distinguish which HG00099 corresponds to which input file.

Thoughts?

andrewpatto commented 2 years ago

(we can possibly use the prefix mechanism when creating the fingerprints - but were hoping that fingerprints could retain the identical sample id as in their corresponding source bam)

brentp commented 2 years ago

I think the prefix mechanism is the only way, otherwise, as you note, how would you differentiate?

andrewpatto commented 2 years ago

I'd really love to leave the 'permanent' fingerprints - where they sit next to the BAM files - with their correct simple sample ids..

I do have a phase though where I bring copies all into the cohort directory - where I am going to run relate on them - and so can 'patch' the sample id at runtime - with whatever logic I need to distinguish the files. I see the sample id text in the first few bytes of the file format - is there a safe (albeit hacky) way I can alter that after I copy the fingerprint files? Is is a fixed length field? (I see that the fingerprints are all the same length)

brentp commented 2 years ago

That field is not fixed-length, it's the length of the sample-name from the SM tag in the bam read-group.

I still don't understand why you can't use --sample-prefix for somalier extract and again for somalier relate; if indeed you can not, then yes, you could rewrite/patch the sample name in the somalier file (along with the length field that precedes it). You can see the structure of the file by reading the python code here: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py#L7

andrewpatto commented 2 years ago

I'll try the sample prefix - its just all a bit complicated (for our probably unique setup) because of where/when the fingerprints are produced - I'll need to make new dynamic lab wide prefixes just for the purposes of separating out the potentially clashing sample fingerprint ids. Definitely doable - but was hoping to avoid another id prefix into the mix.

Thanks for the file structure pointer - will play with that as well..

brentp commented 2 years ago

ok. note that different groups of files can have different prefixes added (with somalier extract) and then you can remove multiple prefixes when you run somalier relate.