dieterich-lab / rp-bp

Rp-Bp is a Bayesian approach to predict, at base-pair resolution, ribosome occupancy and translation.
MIT License
7 stars 5 forks source link

Trouble parsing refseq annotations #83

Closed bmmalone closed 6 years ago

bmmalone commented 6 years ago

From Charlie:

I am having trouble running it on my data (human genome, refSeq genes in GTF annotation, de novo genes in GTF annotation, ensemble rRNA). I've confirmed in IGV that the GTFs look appropriately formatted.

slurm-fail-one.err.txt

csoeder commented 6 years ago

The source for my refSeq GTF file was the UCSC genome browser (hg19): https://genome.ucsc.edu/cgi-bin/hgTracks

I'm going to try running it on the GRCh37 GTF from ensembl.org: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens

bmmalone commented 6 years ago

Great, thanks. I will also try the refseq annotations; hopefully that at least narrows things down to a parsing problem.

csoeder commented 6 years ago

I got what looks like the same error using the Ensemble GTF linked above.

bmmalone commented 6 years ago

Sorry for the slow followup.

I am now trying things with these files:

Also, if at all possible, I recommend updating to the 38 build; it is several years old, so should be considered stable.

bmmalone commented 6 years ago

Hmm... I'm having trouble reproducing those errors. I used the annotations linked above using the 37 build and ensembl 75 annotations, and everything finished smoothly.

Can you please post the output of pip3 freeze so we can compare versions of other libraries. For reference, here is the output from virtual env I am using:

appdirs==1.4.3
asn1crypto==0.23.0
bcrypt==3.1.4
bio-utils==0.2.4
biopython==1.70
certifi==2017.7.27.1
cffi==1.11.2
chardet==3.0.4
cryptography==2.1.3
cycler==0.10.0
Cython==0.27.3
decorator==4.1.2
docopt==0.6.2
et-xmlfile==1.0.1
fastparquet==0.1.3
graphviz==0.8.1
idna==2.6
jdcal==1.3
joblib==0.11
llvmlite==0.20.0
matplotlib==2.1.0
matplotlib-venn==0.11.5
misc==0.2.5
more-itertools==3.2.0
mygene==3.0.0
networkx==2.0
numba==0.35.0
numexpr==2.6.4
numpy==1.13.3
openpyxl==2.4.9
pandas==0.21.0
paramiko==2.3.1
patsy==0.4.1
psutil==5.4.0
pyasn1==0.3.7
pycparser==2.18
pydot==1.2.3
pyfasta==0.5.2
PyNaCl==1.2.0
pyparsing==2.2.0
pysam==0.12.0.1
pystan==2.16.0.0
python-dateutil==2.6.1
pytz==2017.3
PyYAML==3.12
requests==2.18.4
riboutils==0.2.5
rpbp==1.1.10
scikit-learn==0.19.1
scipy==1.0.0
seaborn==0.8.1
six==1.11.0
sklearn==0.0
spur==0.3.20
statsmodels==0.8.0
tables==3.4.2
thrift==0.10.0
tqdm==4.19.4
urllib3==1.22
xlrd==1.1.0

In particular, the versions of pandas, numpy and possibly bio-utils, riboutils, rpbp and misc could affect those steps in the pipeline.

Otherwise, there could be something with the /pine/scr/c/s/csoeder/hSap_rpbp_test/genome_index/hg19Test.annotated.bed.gz. The script breaks pretty much immediately, so it would be something in the first lines. If possible, could you please also post that file, or at least the beginning of it?

csoeder commented 6 years ago

Here's the pip3 freeze output from within my virtual environment; there are some differences in the library versions you pointed out: (virtEnv) [csoeder@c0938 ~]$ pip3 freeze alabaster==0.7.10 alembic==0.8.10 appdirs==1.4.0 Babel==2.4.0 backports.csv==1.0.4 bio-utils==0.2.3 biopython==1.65 brainx==0.1.dev0 bz2file==0.98 certifi==2017.7.27.1 cffi==1.9.1 chardet==3.0.4 click==6.7 cryptography==1.7.2 cycler==0.10.0 Cython==0.25.2 docopt==0.6.2 docutils==0.13.1 et-xmlfile==1.0.1 future==0.16.0 gdbn==0.1 gnumpy==0.2 graphviz==0.8 idna==2.2 imagesize==0.7.1 ipython==4.0.1 ipython-genutils==0.1.0 jdcal==1.3 Jinja2==2.9.5 joblib==0.11 jupyterhub==0.7.2 Lasagne==0.1 llvmlite==0.20.0 matplotlib==2.0.2 matplotlib-venn==0.11.5 misc==0.1.6 mpmath==0.19 multiqc==0.9 mygene==3.0.0 mysql-connector-python==2.0.4 nibabel==2.1.0 nipy==0.4.1 nltk==3.2.2 nolearn==0.6.0 nose==1.3.7 nose-parameterized==0.5.0 numba==0.35.0 numpy==1.12.0 openpyxl==2.4.8 pamela==0.3.0 pandas==0.20.3 patsy==0.4.1 pexpect==4.2.1 pickleshare==0.7.4 protobuf==3.3.0 psutil==5.3.1 ptyprocess==0.5.1 py==1.4.32 pyasn1==0.2.2 pycparser==2.17 pycuda==2016.1.2 pyfasta==0.5.2 Pygments==2.2.0 pyOpenSSL==16.2.0 pyparsing==2.2.0 pysam==0.12.0.1 pystan==2.16.0.0 pytest==3.0.6 python-dateutil==2.6.1 python-editor==1.0.3 pytools==2016.2.6 pytz==2017.2 PyYAML==3.12 requests==2.13.0 riboutils==0.2.4 rpbp==1.1.9 rpy2==2.8.5 scikit-learn==0.19.0 scipy==0.18.1 seaborn==0.8.1 simplegeneric==0.8.1 simplejson==3.10.0 six==1.10.0 snowballstemmer==1.2.1 Sphinx==1.6.2 sphinxcontrib-websupport==1.0.1 SQLAlchemy==1.1.5 statsmodels==0.8.0 sympy==1.0 synapseclient==1.6.1 tabulate==0.7.7 Theano==0.8.2 tornado==4.4.2 tqdm==4.19.1.post1 traitlets==4.3.1 urllib3==1.22 virtualenv==15.1.0 xlrd==1.1.0 xlsx2csv==0.7.3

Here is the annotated.bed file:

$ head -n 20 hg19Test.annotated.bed

seqname #start #end #id #score #strand #thick_start #thick_end #color #num_exons #exon_lengths #exon_genomic_relative_starts

chr1 11868 102519301 NR_148357 0 + -1 -1 0 6 359,109,1142,1142,109,359 0,744,1352,102504939,102506580,102507074 chr1 11873 14409 NR_046018 0 + -1 -1 0 3 354,109,1189 0,739,1347 chr1 14361 29370 NR_024540 0 - -1 -1 0 11 468,69,152,159,198,136,137,147,99,154,50 0,608,1434,2245,2496,2871,3244,3553,3906,10376,14959 chr1 17368 102513794 NR_106918 0 - -1 -1 0 3 68,68,68 0,49683,102496358 chr1 17368 102513794 NR_107062 0 - -1 -1 0 3 68,68,68 0,49683,102496358 chr1 17368 102513794 NR_107063 0 - -1 -1 0 3 68,68,68 0,49683,102496358 chr1 17368 102513794 NR_128720 0 - -1 -1 0 3 68,68,68 0,49683,102496358 chr1 34610 77690 NR_026818 0 - -1 -1 0 6 564,205,361,564,205,361 0,666,1110,41609,42275,42719 chr1 34610 77690 NR_026820 0 - -1 -1 0 6 564,205,361,564,205,361 0,666,1110,41609,42275,42719 chr1 69090 70008 NM_001005484 0 + 69090 70005 0 1 918 0 chr1 134772 140566 NR_039983 0 - -1 -1 0 3 4924,58,492 0,5017,5302 chr1 323891 180755196 NR_028322 0 + -1 -1 0 6 169,58,4143,169,58,4143 0,396,547,180426615,180427011,180427162 chr1 323891 180755196 NR_028325 0 + -1 -1 0 6 169,58,4143,169,58,4143 0,396,547,180426615,180427011,180427162 chr1 323891 180755196 NR_028327 0 + -1 -1 0 8 169,58,2500,1546,169,58,2500,1546 0,396,547,3144,180426615,180427011,180427162,180429759 chr1 367658 180795226 NM_001005221 0 + 367658 180795223 0 2 939,939 0,180426629 chr1 367658 180795226 NM_001005224 0 + 367658 180795223 0 2 939,939 0,180426629 chr1 367658 180795226 NM_001005277 0 + 367658 180795223 0 2 939,939 0,180426629 chr1 562759 564389 NR_125957 0 - -1 -1 0 3 444,263,91 0,581,1539 chr1 567704 567793 NR_106781 0 - -1 -1 0 1 89 0

csoeder commented 6 years ago

I've tried updating the library versions individually, and using your freeze output as a pip install requirement, and can't seem to avoid errors like "Could not find a version that satisfies the requirement bio-utils==0.2.4"

Using the most up-to-date library versions I could manage: [code] bio-utils==1.0.4 gnumpy==0.2 misc==0.1.6 numpy==1.12.0 pandas==0.21.0 riboutils==0.2.4 rpbp==1.1.9 [/code]

...I get a different error at an earlier point in the code:

INFO root 2017-11-24 14:20:22,587 : gtf-to-bed12 /proj/cdjones_lab/Genomics_Data_Commons/annotations/homo_sapiens/refSeq_hg19.gtf /pine/scr/c/s/csoeder/hSap_rpbp_test/genome_index/hg19Test.annotated.bed.gz --num-cpus 2 --chr-name-file /pine/scr/c/s/csoeder/hSap_rpbp_test/STAR/chrName.txt --logging-level INFO --stderr-logging-level NOTSET --stdout-logging-level NOTSET --file-logging-level NOTSET INFO root 2017-11-24 14:20:22,590 : calling Traceback (most recent call last): File "/nas/longleaf/home/csoeder/modules/rp-bp/virtEnv/bin/gtf-to-bed12", line 7, in from bio_utils.bio_programs.gtf_to_bed12 import main ImportError: No module named 'bio_utils.bio_programs' Traceback (most recent call last): File "/var/spool/slurmd/job14077081/slurm_script", line 11, in sys.exit(main()) File "/nas/longleaf/home/csoeder/modules/rp-bp/virtEnv/lib/python3.5/site-packages/rpbp/reference_preprocessing/prepare_rpbp_genome.py", line 192, in main get_orfs(config['gtf'], args, config, is_annotated=True, is_de_novo=False) File "/nas/longleaf/home/csoeder/modules/rp-bp/virtEnv/lib/python3.5/site-packages/rpbp/reference_preprocessing/prepare_rpbp_genome.py", line 42, in get_orfs overwrite=args.overwrite, call=call) File "/nas/longleaf/home/csoeder/modules/rp-bp/virtEnv/lib/python3.5/site-packages/misc/shell_utils.py", line 241, in call_if_not_exists ret_code = check_call(cmd, call=call, raise_on_error=raise_on_error) File "/nas/longleaf/home/csoeder/modules/rp-bp/virtEnv/lib/python3.5/site-packages/misc/shell_utils.py", line 89, in check_call return check_call_step(cmd, call=call, raise_on_error=raise_on_error) File "/nas/longleaf/home/csoeder/modules/rp-bp/virtEnv/lib/python3.5/site-packages/misc/shell_utils.py", line 72, in check_call_step raise subprocess.CalledProcessError(ret_code, cmd) subprocess.CalledProcessError: Command 'gtf-to-bed12 /proj/cdjones_lab/Genomics_Data_Commons/annotations/homo_sapiens/refSeq_hg19.gtf /pine/scr/c/s/csoeder/hSap_rpbp_test/genome_index/hg19Test.annotated.bed.gz --num-cpus 2 --chr-name-file /pine/scr/c/s/csoeder/hSap_rpbp_test/STAR/chrName.txt --logging-level INFO --stderr-logging-level NOTSET --stdout-logging-level NOTSET --file-logging-level NOTSET' returned non-zero exit status 1

eboileau commented 6 years ago

Hi, I'm not sure if that may help (for part of the problem at least), but I just installed rpbp on my local machine and I had a similar issue with "bio_utils". I had to re-install using "-r requirements.txt --ignore-installed" to force re-installing, then after updating my PYTHONPATH, it seems to be running.

eboileau commented 6 years ago

I guess this was due to a previous failed install of rpbp at a different location, due to issues with pymisc and pybio. So in my case, "bio_utils" was found, but not at the right place. This may or may not be relevant to your case.

csoeder commented 6 years ago

I tried the --ignore-installed option, but got the same error message :(

csoeder commented 6 years ago

Additionally, some of the required library versions seem not to exist all, including the bio-utils package which seems to be causing the error:

$ pip install --ignore-installed -r requirements.txt Processing /nas/longleaf/home/csoeder/modules/rp-bp [.........] Collecting bio-utils==0.2.4 (from riboutils->-r requirements.txt (line 2)) Could not find a version that satisfies the requirement bio-utils==0.2.4 (from riboutils->-r requirements.txt (line 2)) (from versions: 0.1.0.0, 0.2.0.0, 0.3.0.0, 0.4.0.0, 0.4.1.0, 0.4.2.0, 0.4.3.0, 0.4.4.0, 0.4.5.0, 0.4.6.0, 0.4.7.0, 0.4.8.0, 0.4.9.0, 0.5.0.0, 0.5.1.0, 0.5.2.0, 0.5.3.0, 0.5.4.0, 0.5.4.1, 0.5.4.2, 0.5.4.3, 0.5.4.4, 0.5.4.5, 0.5.4.6, 0.5.4.7, 0.6.0.0, 0.7.0.0, 0.7.1.0, 0.7.1.1, 0.7.2.0, 0.7.3.0, 0.7.4.0, 0.7.5.0, 0.7.6.0, 0.7.7, 0.7.8, 0.7.9, 0.7.10, 0.7.12, 0.7.13, 0.7.14a1, 0.7.14a2, 0.7.14a3, 0.7.14, 0.7.15, 0.7.16, 0.7.17, 0.7.18, 0.7.19a5, 0.7.19a6, 0.7.19a7, 0.7.19a8, 0.7.19a9, 0.7.19a10, 0.7.19a11, 0.7.19a12, 0.7.19a13, 0.7.19a14, 0.7.19a15, 0.7.19a16, 0.7.19a20, 0.7.19a21, 0.7.19a22, 0.7.19a23, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4) No matching distribution found for bio-utils==0.2.4 (from riboutils->-r requirements.txt (line 2))

bmmalone commented 6 years ago

For some reason, it is attempting to look at the wrong package. The correct one (which is specified in the requirements.txt file, so I'm not sure why it doesn't find it) is here.

csoeder commented 6 years ago

I've redownloaded & reinstalled the libraries; version numbers are fixed, still getting the error though :(

$ pip3 freeze | grep -e pandas -e numpy -e bio-utils -e riboutils -e rpbp -e misc bio-utils==0.2.4 gnumpy==0.2 misc==0.2.5 numpy==1.12.0 pandas==0.21.0 riboutils==0.2.5 rpbp==1.1.11

eboileau commented 6 years ago

It looks like the requirements.txt file is not giving you the right versions. They were updated on Friday, in particular if you are installing rpbp==1.1.11, you should have the following versions: misc==0.2.5 bio-utils==0.2.5 riboutils==0.2.6 You could also try to install each of these packages separately before making a fresh install of rpbp. If you are using virtual environments, and in particular conda, see #86. You may have to recompile the Stan models #45.

eboileau commented 6 years ago

The correct requirements are:

git+https://github.com/bmmalone/pymisc-utils.git@0.2.5#egg=misc
git+https://github.com/dieterich-lab/pybio-utils.git@0.2.5#egg=bio_utils
git+https://github.com/dieterich-lab/riboseq-utils.git@0.2.6#egg=riboutils
eboileau commented 6 years ago

Apologies for not mentioning this earlier, but version requirements in setup.py for rp-bp were updated in dev branch after merge, so they will be wrong in master. As said earlier, they should be:

external_requirements = [
...
    'misc==0.2.5', # this has to be installed via requirements.txt
    'riboutils==0.2.6', # this, too,
    'bio-utils==0.2.5'  # and me!
]

This will be soon corrected.

csoeder commented 6 years ago

Ok, I've tried this a few different ways, including installing the required libraries before starting the rprp installation, as well as editing setup.py to include the above external requirements. Each time I've confirmed the library versions: $ cat pip3.freeze.now bio-utils==0.2.5 gnumpy==0.2 misc==0.2.5 numpy==1.12.0 riboutils==0.2.6 rpbp==1.1.11

But (in addition to some deprecation warnings that weren't there before), bed_utils crashes with the same ValueError

eboileau commented 6 years ago

I haven't followed this from the beginning... I understand that the installation was successful in the end, and that you no longer have issues with requirements, but that you still have ValueError: invalid literal for int() with base 10: '' (opening comment on 25 Oct.) at some point when running prepare-rpbp-genome, is that correct? Ok... I'm just thinking quickly, this would occur if there is an empty string or if you pass a string representation of a float into int, so this could be a problem with the GTF file, when generating GTF -> BED, since this seems to occur when it's trying to extract the ORFs coordinates. I'm not sure what was @bmmalone final comment/conclusion on this...?

eboileau commented 6 years ago

@csoeder Have you managed to install Rp-Bp? Installation instructions have been updated in dev branch. This should soon be available in the main branch.

csoeder commented 6 years ago

Hi, sorry for the delay. I've tried starting over with the most recent versions of rp-bp and dependencies and am now hitting a different problem earlier in the pipeline. The software runs fine on the example dataset, but when I try to run prepare-rpbp-genome to human data, STAR crashes, giving this error:

EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=2147483648is too small for your genome SOLUTION: please specify --limitGenomeGenerateRAM not less than 8404712832 and make that much RAM available

I've tried to pass a new parameter to STAR through prepare-rpbp-genome using arguments like this: --star-additional-options "--limitGenomeGenerateRAM 8404712832"

However, this returns the same error message, down to the "limitGenomeGenerateRAM=2147483648", suggesting that prepare-rpbp-genome isn't passing this parameter to STAR. Should I open a new issue about this?

eboileau commented 6 years ago

Hi, --star-additional-options is only used when calling run-all-rpbp-instances, see Running the Rp-Bp pipeline. For STAR genome indexing, the amount of RAM to request can be passed as an option with --mem, see options.


The documentation could be updated to highlight this, but we should also add this option to the preprocessing stages. I'm closing this, but if you encounter any other problems, don't hesitate to open a new issue.