Closed andrewpatto closed 2 years ago
(we can possibly use the prefix mechanism when creating the fingerprints - but were hoping that fingerprints could retain the identical sample id as in their corresponding source bam)
I think the prefix mechanism is the only way, otherwise, as you note, how would you differentiate?
I'd really love to leave the 'permanent' fingerprints - where they sit next to the BAM files - with their correct simple sample ids..
I do have a phase though where I bring copies all into the cohort directory - where I am going to run relate
on them - and so can 'patch' the sample id at runtime - with whatever logic I need to distinguish the files. I see the sample id text in the first few bytes of the file format - is there a safe (albeit hacky) way I can alter that after I copy the fingerprint files? Is is a fixed length field? (I see that the fingerprints are all the same length)
That field is not fixed-length, it's the length of the sample-name from the SM tag in the bam read-group.
I still don't understand why you can't use --sample-prefix for somalier extract
and again for somalier relate
; if indeed you can not, then yes, you could rewrite/patch the sample name in the somalier file (along with the length field that precedes it). You can see the structure of the file by reading the python code here: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py#L7
I'll try the sample prefix - its just all a bit complicated (for our probably unique setup) because of where/when the fingerprints are produced - I'll need to make new dynamic lab wide prefixes just for the purposes of separating out the potentially clashing sample fingerprint ids. Definitely doable - but was hoping to avoid another id prefix into the mix.
Thanks for the file structure pointer - will play with that as well..
ok. note that different groups of files can have different prefixes added (with somalier extract
) and then you can remove multiple prefixes when you run somalier relate
.
Hi, wondering if you have any particular thoughts about how to handle sample_id duplicates in the cohort.
We will be using a large cohort split across lots of folders - and in the case of sample re-runs etc - will have fingerprint files created with identical sample ids (but living in different buckets/folders/run contexts).
So as an example - running it in a case where I literally duplicated a fingerprint file into 2 locations (but imagining the actual use case to be duplicate sample ids but different actual fingerprints).
Because the output sample ids come from the fingerprint file - there would be no way to distinguish which HG00099 corresponds to which input file.
Thoughts?