brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

[error] SIGSEGV: Illegal storage access. (Attempt to read from nil?) Segmentation fault (core dumped) ( #131

Open egenomics opened 6 months ago

egenomics commented 6 months ago

Hi, I am getting an error in the last step of ancestry calling, after succesfully generating all the .somalier query files and downloading the relevant ancestry-labels-1kg.tsv file and 1kg.somalier/.somalier files.

Here is the code that we have tested in two different machines with the same error

(base) jlvillanueva@EEP10709:~/Downloads/somalier_aina$ ll
total 117916
drwxrwxr-x  4 jlvillanueva jlvillanueva     4096 Dec 18 15:45 ./
drwxr-xr-x 54 jlvillanueva jlvillanueva    40960 Dec 18 16:06 ../
drwxrwxr-x  3 jlvillanueva jlvillanueva     4096 Dec 18 15:44 1kg.somalier/
-rw-rw-r--  1 jlvillanueva jlvillanueva 82856769 Dec 18 15:44 1kg.somalier.tar.gz
-rw-rw-r--  1 jlvillanueva jlvillanueva    56028 Dec 18 15:44 ancestry-labels-1kg.tsv
drwxrwxr-x  2 jlvillanueva jlvillanueva     4096 Dec 18 15:09 cohort/
-rw-rw-r--  1 jlvillanueva jlvillanueva   265818 Dec 18 15:44 sites.hg38.vcf.gz
-rwxrwxr-x  1 jlvillanueva jlvillanueva 37500280 Dec 18 15:44 somalier*
(base) jlvillanueva@EEP10709:~/Downloads/somalier_aina$ ./somalier ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.18
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Segmentation fault (core dumped)
brentp commented 6 months ago

Hi, can you run the same command with the binary attached here (after gunzip somalier_dbg.gz && chmod +x somalier_dbg) and show the output? somalier_dbg.gz

egenomics commented 6 months ago

Hi, Thanks for the quick response! I get the following error:

(base) jlvillanueva@EEP10709:~/Downloads/somalier_aina$ ./somalier_dbg ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.19
/home/brentp/src/somalier/src/somalier.nim(276) somalier
/home/brentp/src/somalier/src/somalier.nim(263) main
/home/brentp/src/somalier/src/somalierpkg/ancestry.nim(137) ancestry_main
/nim-1.6.6/lib/system/fatal.nim(53) sysFatal
Error: unhandled exception: index out of bounds, the container is empty [IndexDefect]
brentp commented 6 months ago

It seems that the training matrix (1kg) is empty so either the sites don't match or you don't have samples in that directory. What does:

ls -lh 1kg.somalier/*.somalier | head

show?

egenomics commented 6 months ago

I feel a bit dumb... There is another folder inside 1kg.somalier. I have fixed the command. However it still gives an error:

./somalier_dbg ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/1kg-somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.19
Segmentation fault (core dumped)
brentp commented 6 months ago

Hmm. that's a problem that we're not getting any information beynd the segfault now.

egenomics commented 6 months ago

We have tested it in two different computers with the same error :(

brentp commented 6 months ago

Yes, I expect that it will be the same on any machine. How many samples are you looking at? I attach here another binary with hopefully more debug info turned on. Maybe it will give us more clues. somalier_dbg2.gz

The ancestry stuff is, as you're finding, less used and more prone to problems than the rest of somalier. You might also try python scripts/ancestry-predict.py which uses PCA -> SVM instead of a neural network. You can run that with -h to see the arguments.

egenomics commented 6 months ago

I am looking at 24 samples:

ls cohort/*.somalier | wc -l
24

I have tried the debug binary version2 but I get no more information than with the previous one:

./somalier_dbg2 ancestry --labels ancestry-labels-1kg.tsv 1kg.somalier/1kg-somalier/*.somalier ++ cohort/*.somalier
somalier version: 0.2.19
Segmentation fault (core dumped)

About the python script I get a strange error:

python code/somalier/scripts/ancestry-predict.py --labels ancestry-labels-1kg.tsv --backgrounds 1kg.somalier/1kg-somalier/*.somalier --samples cohort/*.somalier --plot test_plot
Traceback (most recent call last):
  File "/home/jlvillanueva/Downloads/somalier_aina/code/somalier/scripts/ancestry-predict.py", line 171, in <module>
    df_pca = df_pca.append(
  File "/home/jlvillanueva/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'

Thanks again for your assistance Brent!

egenomics commented 6 months ago

The plot is generated though: test_plot

brentp commented 6 months ago

Looks like append is gone from pandas. You can change line 171,172 from:

            df_pca = df_pca.append(
                other=(pd.DataFrame(test_reduced, test_samples, labels_pc)))

to:

            df_pca = pd.concat([df_pca, pd.DataFrame(test_reduced, test_samples, labels_pc)])

I think that should work, but haven't tested it.

brentp commented 6 months ago

You can also change other things in the script. For example, line 92 you can change n_components to 3. You can also see the other parameters to change for the SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

If you make all of these changes and get something that looks good, I'd be happy to get a PR that incorporates the changes.

egenomics commented 6 months ago

With these changes it looks like it works. I have tried modifying the components to 3 and for the test run and visually speaking it looks better at assigning populations. I will run it in many more samples to see what we get.

Do you know if there is a background dataset with more population granularity? It will be quite interesting to know the population of origin for certain patients and continental is a hint but still very general. We usually have exomes and panels of genes, so most intergenic SNPs are not captured.

brentp commented 6 months ago

Thousand genomes has finer subpopulations, but then you have so few training samples that it's not as reliable. There may be other resources for this, but I haven't kept up with them.

brentp commented 6 months ago

glad to hear it's working.