Open nvpatin opened 2 years ago
hmm weird -- I don't think it is a memory issue. More likely there is a
software dependency issue.
Could you provide the qiime2 version and the conda environment? You can
display the output of conda env export
On Fri, Sep 30, 2022 at 7:01 PM Nastassia Patin @.***> wrote:
I am receiving a "segmentation fault" error when I try to run DEICODE auto-rpca. I've tried running it both in QIIME2 and standalone, with the standalone as the most recent version installed with conda ("conda install -c conda-forge deicode") both locally and on an HPC system, with the same outcome. My full command is
deicode auto-rpca --in-biom metaflye_hybrid_dfs.biom --output-dir metaflye-hybrid-deicode
And the error message is:
/var/spool/slurmd/job116796/slurm_script: line 22: 82401 Segmentation fault
It's not very informative. On the HPC I tried increasing the memory allocation to 500GB (10 nodes) with no success. I attached the tab-separated text version of my BIOM file here in case that is helpful.
Any suggestions are greatly appreciated.
metaflye_hybrid_dfs.txt https://github.com/biocore/DEICODE/files/9688049/metaflye_hybrid_dfs.txt
— Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/issues/65, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXMTUWKJZVAYEU2ZQ33WA5WLXANCNFSM6AAAAAAQ2FOLKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for the quick response! Here is the output of 'conda env export': name: qiime2-2022.2 channels:
I resolved this problem by changing my parameters for --p-min-feature-count and --p-min-sample-count in QIIME2 (I changed both to 1). I'm not sure how it could be resolved in the standalone DEICODE. This issue can be considered resolved now.
Hi again, I am getting another segmentation fault after I increased the size of my data set. Once again it occurs with both the standalone and QIIME2 plug-in for DEICODE. I tried changing several parameters; here are my most recent:
qiime deicode rpca --i-table metaflye_hybrid_illumina_dfs.qza --p-min-feature-count 1 --p-min-sample-count 500 --o-biplot metaflye_hybrid_illumina_deicode_ordination.qza --o-distance-matrix metaflye_hybrid_illumina_deicode_distance.qza --p-max-iterations 1 --p-n-components 2
deicode auto-rpca --in-biom metaflye_hybrid_illumina_dfs.biom --output-dir Lasker2019_ORFs_metaflye_illumina_hybrid-deicode
I attached the tab-separated text file of the table that I converted to BIOM and QIIME2 .qza formats. It is not a standard amplicon data set, rather it reflects the number of taxonomically annotated ORFs for a set of shotgun metagenomes. I did successfully use a similar but smaller file earlier. metaflye_hybrid_illumina_dfs.txt
Hi, it may be worthwhile to compute feature_count
(i.e. the number of
samples a given feature is observed in) as well as the sample count (number
of observed OTUs )
import pandas as pd
import qiime2
df =
qiime2.Artifact.load('metaflye_hybrid_illumina_dfs.qza').view(pd.DataFrame)
feature_count = (df > 0).sum(axis=0)
sample_count = (df > 0).sum(axis=1)
If you have any microbes that aren't observed in any samples, or samples with no microbes, that will lead to a segfault.
On Mon, Oct 3, 2022 at 4:07 PM Nastassia Patin @.***> wrote:
Hi again, I am getting another segmentation fault after I increased the size of my data set. Once again it occurs with both the standalone and QIIME2 plug-in for DEICODE. I tried changing several parameters; here are my most recent: in QIIME2
qiime deicode rpca --i-table metaflye_hybrid_illumina_dfs.qza --p-min-feature-count 1 --p-min-sample-count 500 --o-biplot metaflye_hybrid_illumina_deicode_ordination.qza --o-distance-matrix metaflye_hybrid_illumina_deicode_distance.qza --p-max-iterations 1 --p-n-components 2 Standalone DEICODE
deicode auto-rpca --in-biom metaflye_hybrid_illumina_dfs.biom --output-dir Lasker2019_ORFs_metaflye_illumina_hybrid-deicode
I attached the tab-separated text file of the table that I converted to BIOM and QIIME2 .qza formats. metaflye_hybrid_illumina_dfs.txt https://github.com/biocore/DEICODE/files/9700329/metaflye_hybrid_illumina_dfs.txt
— Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/issues/65#issuecomment-1265969575, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXM3ROQMIIQU2C6SD53WBM4H3ANCNFSM6AAAAAAQ2FOLKQ . You are receiving this because you commented.Message ID: @.***>
I did as you suggested. My min(feature_count) is 1 and my min(sample_count) is 121.
Are there ways to modify the default parameters in the standalone DEICODE command? I know how to do it in QIIME2 but I would like to try similar modifications in the standalone tool.
Yes and thanks for posting the issue and the data. I was able to replicate the issue. I will need to dig into the error because it is not immediately apparent why it is happening. I have run much bigger and more square datasets without an issue, so I am not convinced it is entirely memory related but it does seem related to the feature space size.
A temporary fix is to reduce the feature space with a frequency filter. Using my laptop I had to use a pretty extreme filter, removing any features in less than 50% of the samples. Maybe on compute cluster you could do less (e.g. 10).
The following commands both worked on my laptop:
deicode rpca --in-biom metaflye_hybrid_illumina_dfs.biom --output-dir metaflye_hybrid_illumina_dfs_test --min-feature-frequency 50
qiime deicode rpca --i-table metaflye_hybrid_illumina_dfs.qza --output-dir metaflye_hybrid_illumina_dfs_test -p-min-feature-frequency 50
and here are all the parameters in the standalone command:
Usage: deicode rpca [OPTIONS]
Runs RPCA with an rclr preprocessing step.
Options:
--in-biom TEXT Input table in biom format. [required]
--output-dir TEXT Location of output files. [required]
--n_components INTEGER The underlying low-rank structure. The input
can be an integer (suggested: 1 < rank < 10)
[minimum 2]. Note: as the rank increases the
runtime will increase dramatically.
[default: 3]
--min-sample-count INTEGER Minimum sum cutoff of sample across all
features. The value can be at minimum zero
and must be an whole integer. It is
suggested to be greater than or equal to
500. [default: 500]
--min-feature-count INTEGER Minimum sum cutoff of features across all
samples. The value can be at minimum zero
and must be an whole integer [default: 10]
--min-feature-frequency INTEGER
Minimum percentage of samples a feature must
appear with a value greater than zero. This
value can range from 0 to 100 with decimal
values allowed. [default: 0]
--max_iterations INTEGER The number of iterations to optimize the
solution (suggested to be below 100; beware
of overfitting) [minimum 1] [default: 5]
--help Show this message and exit.
Thanks for this information. I was able to run it on a cluster with --min-feature-frequency 40, which is higher than I would like but will be ok for now.
If you can identify the problem please let me know! I appreciate the responses and effort.
To follow up on this puzzle: I thought the problem might be that one set of samples (~1/3 of the whole data set) are extremely sparse in their composition, with lots of zeros and otherwise generally low count values compared to the other 2/3 of samples. So I converted the count table to a presence/absence matrix and tried to run it again, but I STILL got a segfault with the default parameters in standalone deicode! I attached the presence/absence matrix as a text file here. It won't let me upload the BIOM table but I can email it to you if you would like. metaflye_hybrid_illumina_dfs-presabs.txt
This wont work with presence / absence matrices (zeros are treated as missing)
On Thu, Oct 6, 2022 at 8:06 PM Nastassia Patin @.***> wrote:
To follow up on this puzzle: I thought the problem might be that one set of samples (~1/3 of the whole data set) are extremely sparse in their composition, with lots of zeros and otherwise generally low count values compared to the other 2/3 of samples. So I converted the count table to a presence/absence matrix and tried to run it again, but I STILL got a segfault with the default parameters in standalone deicode! I attached the presence/absence matrix as a text file here. It won't let me upload the BIOM table but I can email it to you if you would like. metaflye_hybrid_illumina_dfs-presabs.txt https://github.com/biocore/DEICODE/files/9729350/metaflye_hybrid_illumina_dfs-presabs.txt
— Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/issues/65#issuecomment-1270858533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXIAKXWNBMEZA76VX7DWB5SPDANCNFSM6AAAAAAQ2FOLKQ . You are receiving this because you commented.Message ID: @.***>
Ok, so would you agree that is probably the original source of the problem? Seems like I may need to transform the data somehow (maybe just add a count of 1 to every value).
I am receiving a "segmentation fault" error when I try to run DEICODE auto-rpca. I've tried running it both in QIIME2 and standalone, with the standalone as the most recent version installed with conda ("conda install -c conda-forge deicode") both locally and on an HPC system, with the same outcome. My full command is
deicode auto-rpca --in-biom metaflye_hybrid_dfs.biom --output-dir metaflye-hybrid-deicode
And the error message is:
/var/spool/slurmd/job116796/slurm_script: line 22: 82401 Segmentation fault
It's not very informative. On the HPC I tried increasing the memory allocation to 500GB (10 nodes) with no success. I attached the tab-separated text version of my BIOM file here in case that is helpful.
Any suggestions are greatly appreciated.
metaflye_hybrid_dfs.txt