OmarOakheart / nPhase

Ploidy agnostic phasing pipeline and algorithm
GNU General Public License v3.0
42 stars 4 forks source link

No plots and phased FastQ generated, AttributeError: 'DataFrame' object has no attribute 'append' #25

Closed mdondrup closed 9 months ago

mdondrup commented 10 months ago

I am trying to run nPhase on long read and short read data from tetraploid yeast strain. However, after running nphase pipeline or algorithm, no plots are generated in Phased/Plots and no files in Phased/FastQ while .tsv files are generated.

Command:

          nphase algorithm --sampleName Kveik_sample_6 --longReads   sample6_cleaned_trimmed.fastq.gz --contextDepth  \
          nphase_out/Kveik_sample_6/Overlaps/Kveik_sample_6.contextDepths.tsv --processedLongReads \
          nphase_out/Kveik_sample_6/VariantCalls/longReads/Kveik_sample_6.hetPositions.SNPxLongReads.validated.tsv  \
          --output ./nphase_out --threads 20 --reference ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna

There is no error message in the log file, but an error is printed on STDERR. Here is the output of the run:

Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (163818 sequences)
334 clusters
Initializing cachedCluster
Filling cachedCluster with similarity information
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:727: PlotnineWarning: Saving 18 x 10 in image.
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:730: PlotnineWarning: Filename: ./nphase_out/Kveik_sample_6/Phased/Plots/Kveik_sample_6_0.1_0.01_0.05_0_phasedVis.svg
Preparing initial similarity index
Starting clustering loop (168228 sequences)
187 clusters

Phased files can be found at ./nphase_out/Kveik_sample_6/Phased
The *_variants.tsv file contains information on the consensus heterozygous variants present in each predicted haplotig.
The *_clusterReadNames.tsv file contains information on the reads which comprise each cluster.
Traceback (most recent call last):
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/bin/nphase", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 589, in main
    nPhaseAlgorithm(args)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 256, in nPhaseAlgorithm
    nPhaseFunctions.generatePhasingVis(simpleOutPath,datavisPath)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 594, in generatePhasingVis
    ggsave(g,filename=outputSVG,width=18,height=10)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 761, in ggsave
    return plot.save(*arg, **kwargs)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 750, in save
    raise err
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 747, in save
    _save()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 734, in _save
    fig = figure[0] = self.draw()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 181, in draw
    return self._draw(return_ggplot)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 188, in _draw
    self._build()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 284, in _build
    layout.setup(layers, self)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/layout.py", line 64, in setup
    layer.data = self.facet.map(ldata, self.layout)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/facet_wrap.py", line 136, in map
    keys = join_keys(facet_vals, layout, self.vars)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/utils.py", line 370, in join_keys
    joint = x[by].append(y[by], ignore_index=True)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'

nPhase was installed using conda in Ubuntu as to the instructions.

Python and pandas version:

Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'2.0.3'

Linux ubuntu-compute-2 5.-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

OmarOakheart commented 10 months ago

Thank you for pointing out this issue, it looks like pandas >2.0 breaks the version of plotnine (which tries to generate the plots) that nPhase uses. The fastQ file generation happens after the plot generation since some datasets might be so large there's an out of memory error, and I wanted people to at least have some plots if that happens to know if it's worth running again with more memory

Can you try to downgrade pandas to 1.5.3 and run nPhase again on the test dataset in https://github.com/OmarOakheart/nPhase/tree/master/example ? It should run very quickly since it's a small dataset

mdondrup commented 10 months ago

Thank you for the reply. I installed pandas 1.5.3 and now I am getting a different error using the example data:


Phased files can be found at nphase_example_out/Example1/Phased
The *_variants.tsv file contains information on the consensus heterozygous variants present in each predicted haplotig.
The *_clusterReadNames.tsv file contains information on the reads which comprise each cluster.
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:727: PlotnineWarning: Saving 18 x 10 in image.
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:730: PlotnineWarning: Filename: nphase_example_out/Example1/Phased/Plots/Example1_0.1_0.01_0.05_0_phasedVis.svg
Traceback (most recent call last):
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/bin/nphase", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 587, in main
    nPhasePipeline(args)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 194, in nPhasePipeline
    nPhaseFunctions.generatePhasingVis(simpleOutPath,datavisPath)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 594, in generatePhasingVis
    ggsave(g,filename=outputSVG,width=18,height=10)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 761, in ggsave
    return plot.save(*arg, **kwargs)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 750, in save
    raise err
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 747, in save
    _save()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 734, in _save
    fig = figure[0] = self.draw()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 181, in draw
    return self._draw(return_ggplot)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 188, in _draw
    self._build()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 284, in _build
    layout.setup(layers, self)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/layout.py", line 64, in setup
    layer.data = self.facet.map(ldata, self.layout)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/facet_wrap.py", line 136, in map
    keys = join_keys(facet_vals, layout, self.vars)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/utils.py", line 372, in join_keys
    joint = pd.concat([x[by], pd.DataFrame([y[by]])], ignore_index=True)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/pandas/core/frame.py", line 762, in __init__
    mgr = ndarray_to_mgr(
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 329, in ndarray_to_mgr
    values = _prep_ndarraylike(values, copy=copy_on_sanitize)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 583, in _prep_ndarraylike
    raise ValueError(f"Must pass 2-d input. shape={values.shape}")
ValueError: Must pass 2-d input. shape=(1, 2, 1)
OmarOakheart commented 10 months ago

I received a different error report today from someone who pointed out that conda takes forever to run. In order to resolve that issue I recomment using mamba https://github.com/conda-forge/miniforge instead of conda to create the environment. I then checked if that worked, and it seems that some of the instructions in the README were outdated. I think the matplotlib version I previously recommended downgrading to may have caused your issues.

I believe that if you use mamba and create a fresh environment, the test data should run flawlessly. I apologize for the inconvenience, please let me know if you have any trouble after that. So, to be clear,

micromamba create -n polyploidPhasing -c oakheart nphase -c bioconda
micromamba activate polyploidPhasing

Should now be sufficient to install nPhase, no need to downgrade pandas or matplotlib or anything else

mdondrup commented 10 months ago

I created a fresh environment by the commands above. We are using the setup from the Biostar handbook, therefore we are already using micromamba. This installs:

ModuleNotFoundError: No module named 'matplotlib._contour' (compile-time error)

I then tested the other combinations:

ValueError: Must pass 2-d input. shape=(1, 2, 1) (Runtime error following clustering)

ValueError: Must pass 2-d input. shape=(1, 2, 1)

ModuleNotFoundError: No module named 'matplotlib._contour'

mdondrup commented 10 months ago

Dear @OmarOakheart I suspect there is some version conflict of python packages somewhere. Could you do the following from within a python environment where the test is working:

import pkg_resources
installed_packages = pkg_resources.working_set
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
   for i in installed_packages])
print(installed_packages_list)
['backports.zoneinfo==0.2.1', 'certifi==2023.11.17', 'contourpy==1.1.1', 'cycler==0.12.1', 'descartes==1.1.0', 
'fonttools==4.46.0', 'importlib-resources==6.1.1', 'kiwisolver==1.4.5', 'matplotlib==3.7.3', 'mizani==0.9.3', 'nphase==1.2.0', 
'numpy==1.24.4', 'olefile==0.47', 'packaging==23.2', 'pandas==2.0.3', 'patsy==0.5.4', 'pillow==8.2.0', 'pip==23.3.1', 
'plotnine==0.7.1', 'pyparsing==3.1.1', 'pyqt5-sip==4.19.18', 'pyqt5==5.12.3', 'pyqtchart==5.12', 'pyqtwebengine==5.12.1', 
'python-dateutil==2.8.2', 'pytz==2023.3.post1', 'scipy==1.9.3', 'setuptools==68.2.2', 'six==1.16.0', 
'sortedcontainers==2.4.0', 'statsmodels==0.14.0', 'tornado==6.3.3', 'tzdata==2023.3', 'unicodedata2==15.1.0', 
'wheel==0.42.0', 'zipp==3.17.0']
mdondrup commented 10 months ago

Follow up here: I tried to install using PIP instead and that worked. I created a conda environment for the dependencies:

micromamba create -n nphasegit python=3.8 bwa gatk=4.3 samtools=1.9 ngmlr
micromamba activate nphasegit
pip install -U nPhase

The packages with version differences are:

It might help to include or update the version requirements in the conda package according to these.

Best regards Michael

['backports.zoneinfo==0.2.1', 'contourpy==1.1.1', 'cycler==0.12.1', 'fonttools==4.46.0', 'importlib-resources==6.1.1', 'kiwisolver==1.4.5', 'matplotlib==3.7.4', 'mizani==0.9.3', 'nphase==1.2.0', 'numpy==1.24.4', 'packaging==23.2', 'pandas==2.0.3', 'patsy==0.5.4', 'pillow==10.1.0', 'pip==23.3.1', 'plotnine==0.12.4', 'pyparsing==3.1.1', 'python-dateutil==2.8.2', 'pytz==2023.3.post1', 'scipy==1.10.1', 'setuptools==68.2.2', 'six==1.16.0', 'sortedcontainers==2.4.0', 'statsmodels==0.14.0', 'tzdata==2023.3', 'wheel==0.42.0', 'zipp==3.17.0']
OmarOakheart commented 10 months ago

Hello,

Sorry for the delay, you were right, there was a package incompatibility caused by nPhase 1.2.0 requiring a specific version of plotnine which is no longer necessary to retain frozen. I was able to replicate your issue (previously when installing with micromamba I didn't realize it installed a previous version of nPhase)

I've uploaded nPhase 1.2.1 which does not have this requirement, and it can now install properly. I've also tested that it produces plots and fastQ files on my new installation.

I think that should resolve the issue fully. You can make sure you're installing the correct version by running

micromamba create -n polyploidPhasing -c oakheart nphase=1.2.1 -c bioconda

mdondrup commented 9 months ago

Thank you for the fix. I have tested it on the example data and I can confirma that plots and phased fastq files are produced with few minor warnings:

/home/ubuntu/micromamba/envs/polyploidPhasing-2/lib/python3.8/site-packages/plotnine/themes/themeable.py:1902: FutureWarning: You no longer need to use subplots_adjust to make space for the legend or text around the panels. This paramater will be removed in a future version. You can still use 'plot_margin' 'panel_spacing' for your other spacing needs.

OmarOakheart commented 9 months ago

Thank you for pointing out the warning, glad the issue was successfully resolved. Don't hesitate to contact me for any assistance needed in running nPhase on your data

Best, Omar