brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

SIGSEGV: Illegal storage access. (Attempt to read from nil?) during Somalier relate #109

Open jlac opened 1 year ago

jlac commented 1 year ago

I am running somalier relate to check relationships in a PED file against those predicted from genotype data, and when I run it without providing the PED file, I get a successful run but with the *samples.tsv output having doubled sample IDs for many of the samples (see attached cumulative_check.samples.tsv). Then, when I do provide the PED file for the analysis, it dies at the end with a segmentation fault; the error looks like this:

[somalier] joining families FAM0001196 and FAM0002040 because of relatedness > 0.2 [somalier] joining families FAM0004676 and FAM0004487 because of relatedness > 0.2 [somalier] joining families FAM0003054 and FAM0003526 because of relatedness > 0.2 [somalier] removing assigned father from P0011761 and setting to unknown [somalier] removing assigned father from P0006232 and setting to unknown [somalier] removing assigned mother from P0006458 and setting to unknown [somalier] removing assigned father from P0004490 and setting to unknown [somalier] removing assigned mother from P0008745 and setting to unknown [somalier] removing assigned mother from P0008591 and setting to unknown SIGSEGV: Illegal storage access. (Attempt to read from nil?) Segmentation fault

I assume there is some sort of issue with PED file I am providing, but I cannot seem to figure out what the issue is. cumulative.ped.zip cumulative_check.samples.tsv.zip

Thank you!

Justin

brentp commented 1 year ago

how many samples? can you show the command that you are running?

brentp commented 1 year ago

I don't see any duplicate samples in your tsv:

cut -f 2 cumulative_check.samples.tsv| sort | uniq -d

that gives no result (so no duplicated samples). It could be that you are running out of memory with more than 5000 samples. But I would try running without --infer, if that's what you did.

jlac commented 1 year ago

Sorry, I did a poor job explaining myself. It's 5,371 samples and the command is:

somalier relate -p cumulative.ped --infer -o pedcheck/test /data/GRIS_NCBR/hgsc_processed/csi_wgs_processing/cumulative_somalier/all_sites/*.somalier

The duplicates are not duplicates on lines, but the sample ID is actually printed twice, like this: P0012446P0012446 For this sample, the ID is actually P0012446.

jlac commented 1 year ago

And I have an interactive node with 72 GB of memory, and the max memory during the run is ~1.5 GB

brentp commented 1 year ago

Hi Justin, It defaults to using the sample id as the family name. First col is family name and second col is sample name. So, it appears twice. I would run it without infer first.

brentp commented 1 year ago

The --infer stuff is less battle-tested. I think the best way is to run without infer and evaluate the HTML. If you share it here or to me privately, I can help interpret. If it looks like --infer is needed, then I can send a debug version of somalier that will help us track down the error.

jlac commented 1 year ago

Removing the --infer flag did results in a successful run. I am attaching the output files.

Also, the sample ID duplication still seems strange. For example, this line in the sample.tsv output:

P0003626P0003626 P0000800P0000800 -9 -9 -9 -9 -9 37.8 6.9 37.7 7.0 0.53 0.39 4817 6603 5753 211 0.005 18.71 350 172 0 178 19.29 14

You can see that the original sample ID is written twice in both the FamilyID column and the sampleID column, in spite of the fact that in the PED file this sample has a FamilyID:

FAM0000330 P0003626 -9 -9 1 -9

Justin test.samples.tsv.zip test.pairs.tsv.zip test.html.zip test.groups.tsv.zip

brentp commented 1 year ago

I would check if you have tabs or single spaces in your pedigree file. Maybe there is some formatting issue.

Otherwise, your output html looks good to me. YOu can turn off unrelated points (blue) by clicking on them in the top of the plot. Then you can hover over points and see pairs of supposedly related samples that do not appear to be so based on the genotypes. There is a cluster of 4 pink points (parent-child) that have very low IBS0 and IBS2.

If you send your pedigree file, I can also look for issues there.

jlac commented 1 year ago

Thanks, I will try those things! The PED file is attached. cumulative.ped.zip

brentp commented 1 year ago

oh. did you use --sample-prefix when you ran extract? That would be where the duplicated names came from. You only need that if you have for example, sequenced P0003626 two times and both bam files have that as the read-group ID. then you want to use a sample-prefix for e.g. A-P0003626 and B-P0003626 to differentiate them.

That is my best guess, that if you go back and re-extract those few samples without using --sample-prefix, it will be resolved.

I looked at your pedigree file and don't see any problems.