JulianneDavid / shared-cancer-splicing

Code for reproducing analyses and figures for shared alternative cancer splicing paper
MIT License
4 stars 2 forks source link

Empty directory after running jx_indexer.py in experiment mode #3

Open linagapa opened 4 years ago

linagapa commented 4 years ago

Hi Julianne, After an apparently successful run of jx_indexer.py in index mode, I ran jx_indexer.py in experiment mode which finished with no errors but ended up with an OUTPUT_DIR containing empty tables. When I check the query_log file, it shows this:

INFO:root:starting count_samples function INFO:root:count samples function complete INFO:root:starting count_samples function INFO:root:count samples function complete INFO:root:starting collect all jxs function INFO:root:collecting all jxs for Acute_Myeloid_Leukemia: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Kidney_Chromophobe: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Oligoastrocytoma: INFO:root:0 samples ... And so on for all cancer types which explains the empty final files. What would have gone wrong? new_jx_index.db is 266.1 Gb, is this the expected size or is it expected to contain more data? Thank you very much again for your kind help.

Lina.

JulianneDavid commented 4 years ago

Hi Lina, Thanks for your question. I'm not immediately sure what the problem could be - could you provide a little more information?

1) Most importantly: was the "database_sample_counts" csv file (with timestamp) created? Does it have non-zero sample counts for the various tissue and cancer types?

2) After attempting to collect the cancer-type junctions ("collect all jxs function"), additional experiments should also be run. Does the entire log file continue showing no results even into the later experiments? This would start at INFO:root:starting neojxs both function INFO:root:counting total number of unique neojxs in TCGA: and beyond - if it failed, the log would continue to show 0 samples; if successful, you will see some sql queries and dataframe snippets.

I'm sorry for the difficulty here!

Julianne

linagapa commented 4 years ago

Hi Julianne,

  1. "database_sample_counts" csv file was created but with no content.
  2. Yes, the entire log file continue showing no results, including in the later experiments, so it ends up like this:

INFO:root:collecting neojxs for Uveal_Melanoma: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:count neojxs both complete

Thank you very much for any help!

Lina.

JulianneDavid commented 4 years ago

Hi Lina, I'd like to help resolve this but I think it might be difficult to do from a distance. Would you be willing to do some live debugging together? I've set up a gitter chat room where we could do that, although I need an email address to send an invite. If you're willing to do this, will you suggest a few times that would be convenient for you?

Julianne

JulianneDavid commented 4 years ago

Hi Lina, I understand if you don't want to use the gitter chat to solve this. As an alternative, I've posted some files (in directory mock_files) that mock the GTEx and TCGA files, with only 2 tissue types and a handful of samples. Creating the database with these files should take <2 minutes, and the experiment run, only a few seconds. If you'd like to continue troubleshooting, can I suggest attempting the index and then the experiment run on a mock mini-database?

This should be done with the database stored in a new, mock-specific directory; you do not want to replace the full TCGA/GTEx database.

Specific commands would be (using the regular gencode gtf):

python3 jx_indexer.py -d MOCK_DB_DIR index -c SRP012682.junction_coverage_minimock.tsv -C TCGA.junction_coverage_minimock.tsv -b SRP012682.junction_id_with_transcripts_minimock.bed -B TCGA.junction_id_with_transcripts_minimock.bed -p SRP012682_minimock.tsv -P TCGA_minimock.tsv -s sample_ids_minimock.tsv -g GENCODE_ANNOTATION_GTF

linagapa commented 4 years ago

Hi Julianne, Thank you very much once more time for your reply. I appreciate the availability of the mock_files folder so we can do a faster troubleshooting with it. I have just tried to run index and experiment modes using the mock mini-dataset, and I got the same empty files again. While creating the database with the index mode, I got these messages:

**shared-cancer-splicing/junction_database/index.py:290: ParserWarning: Both a converter and dtype were specified for column gdc_file_id - only the converter will be used converters=uppercase_and_spaces, dtype=str shared-cancer-splicing/junction_database/index.py:290: ParserWarning: Both a converter and dtype were specified for column gdc_cases.project.name - only the converter will be used converters=uppercase_and_spaces, dtype=str phenotype table creating complete, moving to indexing coding regions discovered CDS tree created splice sites extracted starting tcga junctions

0th entry, writing intermediate fill time is 0.008753299713134766 total fill time is 0.02062511444091797 starting tcga junctions

0th entry, writing intermediate fill time is 0.0021305084228515625 total fill time is 0.05573272705078125 all junctions added to db! adding db indexes.

first index done intermediate time is 0.005484819412231445 total time is 0.08250808715820312

second index done intermediate time is 0.004931926727294922 total time is 0.08748793601989746

third index done intermediate time is 0.005036354064941406 total time is 0.09259200096130371

fourth index done intermediate time is 0.005306720733642578 FINAL total time is 0.09796571731567383**

And the head of the query_log file is the same as I showed you before:

INFO:root:starting count_samples function INFO:root:count samples function complete INFO:root:starting count_samples function INFO:root:count samples function complete INFO:root:starting collect all jxs function INFO:root:collecting all jxs for Thyroid_Carcinoma: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Kidney_Chromophobe: INFO:root:0 samples

I'm just wondering if you can infer where the problem is with this info.

JulianneDavid commented 4 years ago

Hi Lina, Thanks for running the mini database! I also hope this will help us get to the bottom of the issue quickly. Do you mind answering a few more questions?

The mini database contains only a few sample types (2 cancers and 2 normal tissues), so in this case it's expected that the query_log will be mostly empty. The cancer types present should be Esophageal_Carcinoma and Lung_Adenocarcinoma - if you scroll down in the log, do you see sql query information for those cancer types, or are they the same as all the others? Were the other files generated (output files in all_jxs, non-core-normal_all_jxs_per_sample, non-core-normal_counts_per_sample, etc.) and do they have any information in them? (They should be very small - a few lines each.)

Thanks!

Julianne

JulianneDavid commented 4 years ago

Hi Lina, I'm sorry, I see above that you wrote "...I got the same empty files again." I missed that the first time. That answers my second question above, and sounds like possibly the entire log file will be empty also. Could you still confirm about the log file, though, please?

Thanks!

Julianne

linagapa commented 4 years ago

Hi Julianne! Ok, the log file shows information for all cancer types, not exclusively for Esophageal_Carcinoma and Lung_Adenocarcinoma. I guess this is already telling something goes wrong during the database generation already? Thanks a lot! Lina.

JulianneDavid commented 4 years ago

Hi Lina, I apologize for being unclear: the "experiment" mode will still go through all cancer types even though it's just the mini database, so the results you posted above are expected for the beginning of the file, with 0 cancer samples for most cancer types. But if you scroll down in the query_log file to Esophageal_Carcinoma, you should see something like:

INFO:root:collecting info for Esophageal_Carcinoma:
INFO:root:timestamp is: 03-23-2020_16.39.52
INFO:root:SELECT DISTINCT jx_id, jx_recounts recount_id, tcga_id, norm_jxs   FROM (SELECT jx, jx_id, jx_recounts, sample_phenotype_map.tcga_id     FROM (SELECT jx, jx_sample_map.jx_id, jx_sample_map.recount_id jx_recounts,           jx_sample_map.coverage cov           FROM jx_sample_map INNER JOIN jx_annotation_map ON jx_sample_map.jx_id == jx_annotation_map.jx_id)     INNER JOIN sample_phenotype_map ON jx_recounts == sample_phenotype_map.recount_id       AND (sample_phenotype_map.project_type_label == 'Esophageal_Carcinoma'))   LEFT JOIN (SELECT DISTINCT jx_sample_map.jx_id norm_jxs FROM jx_sample_map     INNER JOIN sample_phenotype_map ON jx_sample_map.recount_id == sample_phenotype_map.recount_id AND sample_phenotype_map.tumor_normal == 1) ON jx_id == norm_jxs WHERE norm_jxs IS NULL;
INFO:root:      jx_id  recount_id       tcga_id norm_jxs
0       167       67859  TCGA-V5-A7RB     None
1       186       67859  TCGA-V5-A7RB     None
2  93606946       68530  TCGA-V5-A7RC     None
3       185       68530  TCGA-V5-A7RC     None neojunctions

Did that show up? Or is it still "0 samples" like the others?

Also, do you mind providing a bit more information? It would help me to debug if you could send:

  1. The exact commands you used both for index and experiment mode runs, for both the full and the mini-database runs, if you don't mind.

  2. Your software version numbers (for Python and Anaconda especially, but package versions may help also - pandas & sqlite3 especially). This will help me recreate your environment and hopefully figure out what is going on.

Thanks very much for your patience here!

Julianne

linagapa commented 4 years ago

Hi Julianne, Thank you for the explanation. I checked again specifically for Esophageal_Carcinoma but this one still contains an empty output:

INFO:root:collecting all jxs for Esophageal_Carcinoma: INFO:root:0 samples INFO:root:no samples, continuing

About your questions:

  1. Exact commands for index: python3.6 junction_database/jx_indexer.py -d /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/junction_database index -c /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/GTEX_JUNCTION_COVERAGE.tsv -C /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/TCGA_JUNCTION_COVERAGE.tsv -b /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/GTEX_JUNCTION_BED.bed -B /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/TCGA_JUNCTION_BED.bed -p /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/GTEX_PHEN.tsv -P /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/TCGA_PHEN.tsv -s /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/RECOUNT_SAMPLE_IDS.tsv -g /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/GENCODE_GTF.gtf

fo experiment: python3.6 junction_database/jx_indexer.py -d /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/junction_database experiment -o /mnt/data/RSC/rsc_backup/paez/shared-cancer-splicing/junction_files

  1. I created a conda environment for running these jobs with the following software versions: Python 3.6 Anaconda 4.8.3 sqlite3 2.6.0 pandas 1.0.1

I hope this information can be useful to find out what is the problem. Thanks very much to you for your time and help!

Lina.

JulianneDavid commented 4 years ago

Hi Lina, Sorry for the extended delay here. I'm not sure what to make of these results yet, or what the solution might be, although I'm going to keep working on this. Another quick question - are you running this on a laptop or local computer, or on a cluster system? Have you tried the mini database on multiple setups or only one?

Thanks, and thanks for your patience!

Julianne

linagapa commented 4 years ago

Hi Julianne, No worries, I appreciate the time you spend on helping me solving this problem. I'm running this on a linux-based workstation for which I have ssh access. I didn't try yet ruining the mini database on other setups, so I'll give it a try directly using my laptop (MacBook) and I let you know if we have some advances in this front. Thanks a lot! Lina.

JulianneDavid commented 4 years ago

Hi Lina, I wanted to check back in with you on this since it's been a while. I'm happy to continue to help you troubleshoot if this is still something you're working on/having trouble with, although I do plan to close this issue next week otherwise.

In the meantime, I've had other people test the mini-db on different setups and so far have not been able to reproduce your problem. Have you been able to try on your laptop as well as the linux workstation yet?

Julianne

TiongSun commented 3 years ago

Hi Julianne,

Thanks for the scripts and article. I am having similar issue. After running the "index" code, it generated a 303Gb index file. But the "experiment" code return directories with empty files. I also ran the mock files and got the same results.

The log: INFO:root:starting count_samples function INFO:root:count samples function complete INFO:root:starting count_samples function INFO:root:count samples function complete INFO:root:starting collect all jxs function INFO:root:collecting all jxs for Uterine_Corpus_Endometrial_Carcinoma: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Paraganglioma: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Esophageal_Carcinoma: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Glioblastoma_Multiforme: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Ovarian_Serous_Cystadenocarcinoma: INFO:root:0 samples INFO:root:no samples, continuing INFO:root:collecting all jxs for Malignant_Peripheral_Nerve_Sheath_Tumors: INFO:root:0 samples

The command used: python C:\Users\protengineering\Desktop\variants\shared-cancer-splicing-master\shared-cancer-splicing-master\junction_database\jx_indexer.py -d G:\variants\mock_files index -c G:\variants\mock_files\SRP012682.junction_coverage_minimock.tsv -C G:\variants\mock_files\TCGA.junction_coverage_minimock.tsv -b G:\variants\mock_files\SRP012682.junction_id_with_transcripts_minimock.bed -B G:\variants\mock_files\TCGA.junction_id_with_transcripts_minimock.bed -p G:\variants\mock_files\SRP012682_minimock.tsv -P G:\variants\mock_files\TCGA_minimock.tsv -s G:\variants\mock_files\sample_ids_minimock.tsv -g G:\variants\gencode.v28.annotation.gtf

python C:\Users\protengineering\Desktop\variants\shared-cancer-splicing-master\shared-cancer-splicing-master\junction_database\jx_indexer.py -d G:\variants\mock_files experiment -o G:\variants\mock_files\output

I have uploaded the output txt file. The rest of the folder are empty. Have you figured out what might possibly be the issue?

Thank you.

PS. If possible, can you upload some output files/log of a successful run so that we know what to expect. Thanks!

query_log_12-23-2020_13.40.34.txt tcga_total_neojx_counts_12-23-2020_13.40.34_non_GTEx.txt tcga_total_neojx_counts_12-23-2020_13.40.34_non_paired_normal.txt Skin_Cutaneous_Melanoma_all_jxs_normal_NOcov_NOann_filter_12-23-2020_13.40.34.txt Skin_all_jxs_normal_NOcov_NOann_filter_12-23-2020_13.40.34.txt Skin_Cutaneous_Melanoma_all_jxs_tumor_NOcov_NOann_filter_12-23-2020_13.40.34.txt

JulianneDavid commented 3 years ago

Hi TiongSun, Thank you very much for reaching out. I'm sorry you are having this issue as well, and I apologize for the delay in responding; I have been on break. Unfortunately I did not figure out what the problem was with linagapa's run, so I don't have a ready solution here; we were not able to reproduce the problem, and so couldn't solve it. It looks like you are having the same issue, where index mode works, but experiment mode does not. Could you send me your environment setup information so that I can try to reproduce what's happening? I'd like to get this solved!

Thanks so much, Julianne

TiongSun commented 3 years ago

Thank you Julianne for your reply. I run using Anaconda. The environment file is attached. Thank you and look forward to hearing back from you.

req.txt

JulianneDavid commented 3 years ago

Hi TiongSun, I just wanted to give you an update; I don't have immediate access to a windows machine, so I'm working on where/how to set up an environment based your file. I'm sorry for the delay, again.

Julianne

JulianneDavid commented 3 years ago

Hi TiongSun, We're beginning to think this may be an operating system issue; have you only run on windows, or have you tried the mock database on linux or unix? If not, would you be willing to, for instance via cygwin?

Julianne

TiongSun commented 3 years ago

Hi Julianne,

Apologies for late reply. Yes. I will give it a go on linux and keep you posted.

Thank, TS

kockan commented 1 year ago

Hi @JulianneDavid were there any updates on this issue? I'm also having the same "empty output directory" problem described here and I'm working on a linux system.

My machine configuration is:

Linux 5.19.0-1030-gcp #32~22.04.1-Ubuntu SMP Thu Jul 13 09:36:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

and my full package listing is:

Package                Version
---------------------- ----------------
attrs                  21.2.0
Automat                20.2.0
Babel                  2.8.0
bcrypt                 3.2.0
blinker                1.4
certifi                2020.6.20
chardet                4.0.0
click                  8.0.3
cloud-init             23.2.1
colorama               0.4.4
command-not-found      0.3
configobj              5.0.6
constantly             15.1.0
contourpy              1.1.0
cryptography           3.4.8
cycler                 0.11.0
dbus-python            1.2.18
distro                 1.7.0
distro-info            1.1build1
fonttools              4.41.1
httplib2               0.20.2
hyperlink              21.0.0
idna                   3.3
importlib-metadata     4.6.4
incremental            21.3.0
intervaltree           3.1.0
jeepney                0.7.1
Jinja2                 3.0.3
jsonpatch              1.32
jsonpointer            2.0
jsonschema             3.2.0
keyring                23.5.0
kiwisolver             1.4.4
launchpadlib           1.10.16
lazr.restfulclient     0.14.4
lazr.uri               1.0.6
MarkupSafe             2.0.1
matplotlib             3.7.2
mmh3                   4.0.1
more-itertools         8.10.0
netifaces              0.11.0
numpy                  1.25.2
oauthlib               3.2.0
packaging              23.1
pandas                 2.0.3
pexpect                4.8.0
Pillow                 10.0.0
pip                    22.0.2
ptyprocess             0.7.0
pyasn1                 0.4.8
pyasn1-modules         0.2.1
PyGObject              3.42.1
PyHamcrest             2.0.2
PyJWT                  2.3.0
pyOpenSSL              21.0.0
pyparsing              2.4.7
pyrsistent             0.18.1
pyserial               3.5
python-apt             2.4.0+ubuntu1
python-dateutil        2.8.2
python-debian          0.1.43+ubuntu1.1
python-magic           0.4.24
pytz                   2022.1
PyYAML                 5.4.1
requests               2.25.1
scipy                  1.11.1
seaborn                0.12.2
SecretStorage          3.3.1
service-identity       18.1.0
setuptools             59.6.0
six                    1.16.0
sortedcontainers       2.4.0
sos                    4.4
ssh-import-id          5.11
systemd-python         234
Twisted                22.1.0
tzdata                 2023.3
ubuntu-advantage-tools 8001
ufw                    0.36.1
unattended-upgrades    0.1
urllib3                1.26.5
wadllib                1.3.6
wheel                  0.37.1
zipp                   1.0.0
zope.interface         5.4.0
kockan commented 12 months ago

Fixed by #6