biocore / DEICODE

Robust Aitchison PCA from sparse count data
Other
33 stars 17 forks source link

segmentation fault #65

Open nvpatin opened 2 years ago

nvpatin commented 2 years ago

I am receiving a "segmentation fault" error when I try to run DEICODE auto-rpca. I've tried running it both in QIIME2 and standalone, with the standalone as the most recent version installed with conda ("conda install -c conda-forge deicode") both locally and on an HPC system, with the same outcome. My full command is

deicode auto-rpca --in-biom metaflye_hybrid_dfs.biom --output-dir metaflye-hybrid-deicode

And the error message is:

/var/spool/slurmd/job116796/slurm_script: line 22: 82401 Segmentation fault   

It's not very informative. On the HPC I tried increasing the memory allocation to 500GB (10 nodes) with no success. I attached the tab-separated text version of my BIOM file here in case that is helpful.

Any suggestions are greatly appreciated.

metaflye_hybrid_dfs.txt

mortonjt commented 2 years ago

hmm weird -- I don't think it is a memory issue. More likely there is a software dependency issue. Could you provide the qiime2 version and the conda environment? You can display the output of conda env export

On Fri, Sep 30, 2022 at 7:01 PM Nastassia Patin @.***> wrote:

I am receiving a "segmentation fault" error when I try to run DEICODE auto-rpca. I've tried running it both in QIIME2 and standalone, with the standalone as the most recent version installed with conda ("conda install -c conda-forge deicode") both locally and on an HPC system, with the same outcome. My full command is

deicode auto-rpca --in-biom metaflye_hybrid_dfs.biom --output-dir metaflye-hybrid-deicode

And the error message is:

/var/spool/slurmd/job116796/slurm_script: line 22: 82401 Segmentation fault

It's not very informative. On the HPC I tried increasing the memory allocation to 500GB (10 nodes) with no success. I attached the tab-separated text version of my BIOM file here in case that is helpful.

Any suggestions are greatly appreciated.

metaflye_hybrid_dfs.txt https://github.com/biocore/DEICODE/files/9688049/metaflye_hybrid_dfs.txt

— Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/issues/65, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXMTUWKJZVAYEU2ZQ33WA5WLXANCNFSM6AAAAAAQ2FOLKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nvpatin commented 2 years ago

Thanks for the quick response! Here is the output of 'conda env export': name: qiime2-2022.2 channels:

nvpatin commented 2 years ago

I resolved this problem by changing my parameters for --p-min-feature-count and --p-min-sample-count in QIIME2 (I changed both to 1). I'm not sure how it could be resolved in the standalone DEICODE. This issue can be considered resolved now.

nvpatin commented 2 years ago

Hi again, I am getting another segmentation fault after I increased the size of my data set. Once again it occurs with both the standalone and QIIME2 plug-in for DEICODE. I tried changing several parameters; here are my most recent:

in QIIME2

qiime deicode rpca --i-table metaflye_hybrid_illumina_dfs.qza --p-min-feature-count 1 --p-min-sample-count 500 --o-biplot metaflye_hybrid_illumina_deicode_ordination.qza --o-distance-matrix metaflye_hybrid_illumina_deicode_distance.qza --p-max-iterations 1 --p-n-components 2

Standalone DEICODE

deicode auto-rpca --in-biom metaflye_hybrid_illumina_dfs.biom --output-dir Lasker2019_ORFs_metaflye_illumina_hybrid-deicode

I attached the tab-separated text file of the table that I converted to BIOM and QIIME2 .qza formats. It is not a standard amplicon data set, rather it reflects the number of taxonomically annotated ORFs for a set of shotgun metagenomes. I did successfully use a similar but smaller file earlier. metaflye_hybrid_illumina_dfs.txt

mortonjt commented 2 years ago

Hi, it may be worthwhile to compute feature_count (i.e. the number of samples a given feature is observed in) as well as the sample count (number of observed OTUs )

import pandas as pd
import qiime2
df =
qiime2.Artifact.load('metaflye_hybrid_illumina_dfs.qza').view(pd.DataFrame)
feature_count = (df > 0).sum(axis=0)
sample_count = (df > 0).sum(axis=1)

If you have any microbes that aren't observed in any samples, or samples with no microbes, that will lead to a segfault.

On Mon, Oct 3, 2022 at 4:07 PM Nastassia Patin @.***> wrote:

Hi again, I am getting another segmentation fault after I increased the size of my data set. Once again it occurs with both the standalone and QIIME2 plug-in for DEICODE. I tried changing several parameters; here are my most recent: in QIIME2

qiime deicode rpca --i-table metaflye_hybrid_illumina_dfs.qza --p-min-feature-count 1 --p-min-sample-count 500 --o-biplot metaflye_hybrid_illumina_deicode_ordination.qza --o-distance-matrix metaflye_hybrid_illumina_deicode_distance.qza --p-max-iterations 1 --p-n-components 2 Standalone DEICODE

deicode auto-rpca --in-biom metaflye_hybrid_illumina_dfs.biom --output-dir Lasker2019_ORFs_metaflye_illumina_hybrid-deicode

I attached the tab-separated text file of the table that I converted to BIOM and QIIME2 .qza formats. metaflye_hybrid_illumina_dfs.txt https://github.com/biocore/DEICODE/files/9700329/metaflye_hybrid_illumina_dfs.txt

— Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/issues/65#issuecomment-1265969575, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXM3ROQMIIQU2C6SD53WBM4H3ANCNFSM6AAAAAAQ2FOLKQ . You are receiving this because you commented.Message ID: @.***>

nvpatin commented 2 years ago

I did as you suggested. My min(feature_count) is 1 and my min(sample_count) is 121.

nvpatin commented 2 years ago

Are there ways to modify the default parameters in the standalone DEICODE command? I know how to do it in QIIME2 but I would like to try similar modifications in the standalone tool.

cameronmartino commented 2 years ago

Yes and thanks for posting the issue and the data. I was able to replicate the issue. I will need to dig into the error because it is not immediately apparent why it is happening. I have run much bigger and more square datasets without an issue, so I am not convinced it is entirely memory related but it does seem related to the feature space size.

A temporary fix is to reduce the feature space with a frequency filter. Using my laptop I had to use a pretty extreme filter, removing any features in less than 50% of the samples. Maybe on compute cluster you could do less (e.g. 10).

The following commands both worked on my laptop:

deicode rpca --in-biom metaflye_hybrid_illumina_dfs.biom --output-dir metaflye_hybrid_illumina_dfs_test --min-feature-frequency 50
qiime deicode rpca --i-table metaflye_hybrid_illumina_dfs.qza --output-dir metaflye_hybrid_illumina_dfs_test -p-min-feature-frequency 50

and here are all the parameters in the standalone command:

Usage: deicode rpca [OPTIONS]

  Runs RPCA with an rclr preprocessing step.

Options:
  --in-biom TEXT                  Input table in biom format.  [required]
  --output-dir TEXT               Location of output files.  [required]
  --n_components INTEGER          The underlying low-rank structure. The input
                                  can be an integer (suggested: 1 < rank < 10)
                                  [minimum 2]. Note: as the rank increases the
                                  runtime will increase dramatically.
                                  [default: 3]
  --min-sample-count INTEGER      Minimum sum cutoff of sample across all
                                  features. The value can be at minimum zero
                                  and must be an whole integer. It is
                                  suggested to be greater than or equal to
                                  500.  [default: 500]
  --min-feature-count INTEGER     Minimum sum cutoff of features across all
                                  samples. The value can be at minimum zero
                                  and must be an whole integer  [default: 10]
  --min-feature-frequency INTEGER
                                  Minimum percentage of samples a feature must
                                  appear with a value greater than zero. This
                                  value can range from 0 to 100 with decimal
                                  values allowed.  [default: 0]
  --max_iterations INTEGER        The number of iterations to optimize the
                                  solution (suggested to be below 100; beware
                                  of overfitting) [minimum 1]  [default: 5]
  --help                          Show this message and exit.
nvpatin commented 2 years ago

Thanks for this information. I was able to run it on a cluster with --min-feature-frequency 40, which is higher than I would like but will be ok for now.

If you can identify the problem please let me know! I appreciate the responses and effort.

nvpatin commented 2 years ago

To follow up on this puzzle: I thought the problem might be that one set of samples (~1/3 of the whole data set) are extremely sparse in their composition, with lots of zeros and otherwise generally low count values compared to the other 2/3 of samples. So I converted the count table to a presence/absence matrix and tried to run it again, but I STILL got a segfault with the default parameters in standalone deicode! I attached the presence/absence matrix as a text file here. It won't let me upload the BIOM table but I can email it to you if you would like. metaflye_hybrid_illumina_dfs-presabs.txt

mortonjt commented 2 years ago

This wont work with presence / absence matrices (zeros are treated as missing)

On Thu, Oct 6, 2022 at 8:06 PM Nastassia Patin @.***> wrote:

To follow up on this puzzle: I thought the problem might be that one set of samples (~1/3 of the whole data set) are extremely sparse in their composition, with lots of zeros and otherwise generally low count values compared to the other 2/3 of samples. So I converted the count table to a presence/absence matrix and tried to run it again, but I STILL got a segfault with the default parameters in standalone deicode! I attached the presence/absence matrix as a text file here. It won't let me upload the BIOM table but I can email it to you if you would like. metaflye_hybrid_illumina_dfs-presabs.txt https://github.com/biocore/DEICODE/files/9729350/metaflye_hybrid_illumina_dfs-presabs.txt

— Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/issues/65#issuecomment-1270858533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXIAKXWNBMEZA76VX7DWB5SPDANCNFSM6AAAAAAQ2FOLKQ . You are receiving this because you commented.Message ID: @.***>

nvpatin commented 2 years ago

Ok, so would you agree that is probably the original source of the problem? Seems like I may need to transform the data somehow (maybe just add a count of 1 to every value).