JulianneDavid / shared-cancer-splicing

Code for reproducing analyses and figures for shared alternative cancer splicing paper
MIT License
4 stars 2 forks source link

jxsamp2 and jxsampleindex operational error #4

Closed Liam730 closed 4 years ago

Liam730 commented 4 years ago

Dear Julianne,

After around 24h of running the jx_indexer.py in index mode to create the new_jx_index.db, the log said "jxsamp2 operational error" and "jxsampleindex operational error" and the size of database was around 123G. Then, I tried to run the script in experiment mode, but it said "database or disk is full". I have not change the max page size of the sql database and the space of this project directory in my workstation has 15T left. So, I want to know whether the database had been successfully constructed in index mode. If not, whether the error could be "full database"? The code was attached below.

Thanks in advance!

Best,

Liam

Running jx_indexer.py in index mode.

python3 ../junction_database/jx_indexer.py -d ./ index -c GTEX_JUNCTION_COVERAGE.tsv -C TCGA_JUNCTION_COVERAGE.tsv -b GTEX_JUNCTION_BED.bed -B TCGA_JUNCTION_BED.bed -p GTEX_PHEN.tsv -P TCGA_PHEN.tsv -s RECOUNT_SAMPLE_IDS.tsv -g GENCODE_GTF_V28.gtf 

The database file already exists. Would you like to erase the current database file and begin generating the database from scratch? [y/n]
(Note: the full database is 249GB and takes significant resources to generate. Deleting the database file is only recommended if a previous indexing run did not complete successfully.
Would you like to delete the database file? [y/n]: y
Deleting existing database file.
/projects/shared-cancer-splicing/junction_database/index.py:447: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  gtex_phen = pd.read_table(tissue_phen, usecols=['sampid', 'smts', 'run', 'auc'])
/projects/shared-cancer-splicing/junction_database/index.py:467: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  'gdc_cases.diagnoses.tumor_stage': str
/projects/shared-cancer-splicing/junction_database/index.py:488: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  names=['recount_id', 'universal_id']
/projects/shared-cancer-splicing/junction_database/index.py:263: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  names=['recount_id', 'universal_id']
/projects/shared-cancer-splicing/junction_database/index.py:290: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  converters=uppercase_and_spaces, dtype=str
/projects/shared-cancer-splicing/junction_database/index.py:290: ParserWarning: Both a converter and dtype were specified for column gdc_file_id - only the converter will be used
  converters=uppercase_and_spaces, dtype=str
/projects/shared-cancer-splicing/junction_database/index.py:290: ParserWarning: Both a converter and dtype were specified for column gdc_cases.project.name - only the converter will be used
  converters=uppercase_and_spaces, dtype=str
/projects/shared-cancer-splicing/junction_database/index.py:371: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  tissue_phens, usecols=['sampid', 'smts', 'run', 'smtsd']
/projects/shared-cancer-splicing/junction_database/index.py:390: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  full_df = pd.concat([tcga_phen, gtex_phen], ignore_index=True).fillna('')
phenotype table creating complete, moving to indexing
coding regions discovered
CDS tree created
splice sites extracted
starting tcga junctions

0th entry, writing
intermediate fill time is 0.03366231918334961
total fill time is 0.039887428283691406

1000000th entry, writing
intermediate fill time is 1340.480161190033
total fill time is 1340.5222470760345

2000000th entry, writing
intermediate fill time is 1805.2847962379456
total fill time is 3145.830829143524

3000000th entry, writing
intermediate fill time is 1510.7211990356445
total fill time is 4656.682926654816

4000000th entry, writing
intermediate fill time is 1279.6701357364655
total fill time is 5936.366735935211

5000000th entry, writing
intermediate fill time is 1155.0313727855682
total fill time is 7091.40999007225

6000000th entry, writing
intermediate fill time is 1103.9082434177399
total fill time is 8195.34603524208

7000000th entry, writing
intermediate fill time is 1217.2894582748413
total fill time is 9412.691999197006

8000000th entry, writing
intermediate fill time is 1188.807175397873
total fill time is 10601.508999109268

9000000th entry, writing
intermediate fill time is 1483.8623654842377
total fill time is 12085.383065223694

10000000th entry, writing
intermediate fill time is 1095.3373186588287
total fill time is 13180.777994155884

11000000th entry, writing
intermediate fill time is 900.8276903629303
total fill time is 14081.628247261047

12000000th entry, writing
intermediate fill time is 1144.8789982795715
total fill time is 15226.771543979645

13000000th entry, writing
intermediate fill time is 1037.0481944084167
total fill time is 16263.887160778046

14000000th entry, writing
intermediate fill time is 1128.4980201721191
total fill time is 17392.45245361328

15000000th entry, writing
intermediate fill time is 1160.605678319931
total fill time is 18553.111570835114

16000000th entry, writing
intermediate fill time is 1420.1160264015198
total fill time is 19973.274503707886

17000000th entry, writing
intermediate fill time is 954.2554264068604
total fill time is 20927.70289373398

18000000th entry, writing
intermediate fill time is 1010.449805021286
total fill time is 21938.183844327927

19000000th entry, writing
intermediate fill time is 1105.0607590675354
total fill time is 23043.253116846085

20000000th entry, writing
intermediate fill time is 1154.479502439499
total fill time is 24198.024032354355

21000000th entry, writing
intermediate fill time is 1354.1132469177246
total fill time is 25552.3160841465

22000000th entry, writing
intermediate fill time is 1353.6109869480133
total fill time is 26905.935834884644

23000000th entry, writing
intermediate fill time is 1065.984040737152
total fill time is 27971.934435606003

24000000th entry, writing
intermediate fill time is 850.9171245098114
total fill time is 28822.911124944687

25000000th entry, writing
intermediate fill time is 1060.030461549759
total fill time is 29883.172817230225

26000000th entry, writing
intermediate fill time is 1308.599583864212
total fill time is 31191.85195827484

27000000th entry, writing
intermediate fill time is 1069.3831038475037
total fill time is 32261.257700443268

28000000th entry, writing
intermediate fill time is 1197.217345237732
total fill time is 33458.72323489189

29000000th entry, writing
intermediate fill time is 1140.2288234233856
total fill time is 34599.12787818909

30000000th entry, writing
intermediate fill time is 1343.7761099338531
total fill time is 35942.91222047806

31000000th entry, writing
intermediate fill time is 1120.942459344864
total fill time is 37063.86402320862

32000000th entry, writing
intermediate fill time is 1033.0065429210663
total fill time is 38096.884263038635

33000000th entry, writing
intermediate fill time is 1073.92631149292
total fill time is 39170.82423400879

34000000th entry, writing
intermediate fill time is 941.5769293308258
total fill time is 40112.41028499603

35000000th entry, writing
intermediate fill time is 1102.69549202919
total fill time is 41215.13423418999

36000000th entry, writing
intermediate fill time is 772.1114857196808
total fill time is 41987.473977804184
starting tcga junctions

0th entry, writing
intermediate fill time is 0.027585983276367188
total fill time is 42834.209646463394

1000000th entry, writing
intermediate fill time is 1239.1204273700714
total fill time is 44073.51569414139

2000000th entry, writing
intermediate fill time is 1433.2405326366425
total fill time is 45506.76509928703

3000000th entry, writing
intermediate fill time is 1384.9402496814728
total fill time is 46891.71401834488

4000000th entry, writing
intermediate fill time is 1053.6094753742218
total fill time is 47945.33536672592

5000000th entry, writing
intermediate fill time is 1164.3312830924988
total fill time is 49109.76820278168

6000000th entry, writing
intermediate fill time is 1220.179963350296
total fill time is 50330.178143024445

7000000th entry, writing
intermediate fill time is 1174.4764006137848
total fill time is 51504.68827056885

8000000th entry, writing
intermediate fill time is 1121.0915036201477
total fill time is 52625.91539931297

9000000th entry, writing
intermediate fill time is 999.5221936702728
total fill time is 53625.55318188667

10000000th entry, writing
intermediate fill time is 1066.4986679553986
total fill time is 54692.36959886551

11000000th entry, writing
intermediate fill time is 1116.4537031650543
total fill time is 55809.0054833889

12000000th entry, writing
intermediate fill time is 1203.261973142624
total fill time is 57012.454063653946

13000000th entry, writing
intermediate fill time is 1356.5673713684082
total fill time is 58369.21350502968

14000000th entry, writing
intermediate fill time is 983.9775261878967
total fill time is 59353.426255226135

15000000th entry, writing
intermediate fill time is 1020.8703470230103
total fill time is 60374.370203733444

16000000th entry, writing
intermediate fill time is 1160.5760707855225
total fill time is 61535.01049041748

17000000th entry, writing
intermediate fill time is 1292.3074777126312
total fill time is 62827.565974235535

18000000th entry, writing
intermediate fill time is 955.2640810012817
total fill time is 63783.05248236656

19000000th entry, writing
intermediate fill time is 930.2662501335144
total fill time is 64713.75727057457

20000000th entry, writing
intermediate fill time is 1389.0327684879303
total fill time is 66103.05267882347

21000000th entry, writing
intermediate fill time is 1214.8944170475006
total fill time is 67317.95460963249

22000000th entry, writing
intermediate fill time is 1040.630005121231
total fill time is 68358.63213634491

23000000th entry, writing
intermediate fill time is 1067.6009871959686
total fill time is 69426.34779310226

24000000th entry, writing
intermediate fill time is 1324.7294971942902
total fill time is 70751.09366416931

25000000th entry, writing
intermediate fill time is 926.5962507724762
total fill time is 71677.7187743187

26000000th entry, writing
intermediate fill time is 1030.448053598404
total fill time is 72708.21226143837

27000000th entry, writing
intermediate fill time is 997.5799708366394
total fill time is 73705.9893951416

28000000th entry, writing
intermediate fill time is 1089.6713728904724
total fill time is 74795.70169258118

29000000th entry, writing
intermediate fill time is 875.767471075058
total fill time is 75671.52742695808
all junctions added to db!  adding db indexes.

first index done
intermediate time is 29.941766500473022
total time is 75838.0406627655

second index done
intermediate time is 42.09428071975708
total time is 75880.13519239426
jxsampleindex operational error

third index done
intermediate time is 104.85167646408081
total time is 75984.98719644547
jxsamp2 operational error

fourth index done
intermediate time is 85.49750280380249
FINAL total time is 76070.48501801491

Running jx_indexer.py in experiment mode.

python3 ../junction_database/jx_indexer.py -d ./ experiment -o ./JC
Traceback (most recent call last):
  File "/bin/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 1431, in execute
    cur.execute(*args)
sqlite3.OperationalError: database or disk is full

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../junction_database/jx_indexer.py", line 73, in <module>
    submain.main(args, now, conn, index_db)
  File "/projects/shared-cancer-splicing/junction_database/experiment.py", line 658, in main
    collect_data_for_analyses(batch_num, out_path, now, conn, index_db)
  File "/projects/shared-cancer-splicing/junction_database/experiment.py", line 590, in collect_data_for_analyses
    collect_all_jxs(batch_num, all_jxs_dir, now, conn)
  File "/projects/shared-cancer-splicing/junction_database/experiment.py", line 124, in collect_all_jxs
    query_result = pd.read_sql_query(select_command, db_conn)
  File "/bin/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 314, in read_sql_query
    parse_dates=parse_dates, chunksize=chunksize)
  File "/bin/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 1468, in read_query
    cursor = self.execute(*args)
  File "/bin/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 1445, in execute
    raise_with_traceback(ex)
  File "/bin/anaconda3/lib/python3.7/site-packages/pandas/compat/__init__.py", line 420, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "/bin/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 1431, in execute
    cur.execute(*args)
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT jx_annotation_map.jx, jx_annotation_map.annotation, COUNT (phen_recount) FROM (SELECT jx_sample_map.jx_id can_jxs, phen_recount FROM (SELECT recount_id phen_recount FROM sample_phenotype_map WHERE sample_phenotype_map.project_type_label == "Esophageal_Carcinoma" AND sample_phenotype_map.tumor_normal == 0) INNER JOIN jx_sample_map ON phen_recount == jx_sample_map.recount_id) INNER JOIN jx_annotation_map ON jx_annotation_map.jx_id==can_jxs GROUP BY (jx_annotation_map.jx_id);': database or disk is full
JulianneDavid commented 4 years ago

Hi Liam, It looks like your database was fully constructed and populated in index mode, given the log and the db size, but the indexing was not fully executed. (The final size of the full and indexed database is ~260 GB.) The db can be used in this state, but completing the indexing makes it much faster to run the experiment mode queries afterwards. I set this up to pass that error since the indexing isn't critical to forming the database, but I think it's likely that the failed indexing's operational error is a "disk or database full" operational error similar to what you got in the experiment mode. This is most likely caused by the directory that is being used for temporary file storage being out of space, and if this is the issue I should be able to solve this by adding a command line option for you to specify the desired temp file directory (https://sqlite.org/c3ref/temp_directory.html). The list of where sqlite looks for temp file storage is here: https://sqlite.org/tempfiles.html#tempdir Do you know how much space you have in /var/tmp, /usr/tmp, or /tmp (if available, in that order)?

For now, don't re-run the index mode since your database is completely populated and only the indexing remains. Once we make sure the temp file directory is the issue, I can add an "index the created database" mode, or you could execute the indexing in an interactive sqlite3 session.

Julianne

Liam730 commented 4 years ago

Hi Julianne,

Thanks for your reply!

I checked the size of the tmp dir. It said 4.4G out of 20G was available. How much space does the indexing process need for the tmp dir?

To see if tmp space counts, I transferred the new_jx_index.db to a server with ~20G left in the tmp dir and tried to create the samp_id_index and jx_samp_id_index indices for the jx_sample_map table by referring your index.py code in an interactive Sqlite3 session. What surprised me was that these indices were created without any error encountered after ~3.5h! The size of the database come to 248G, equal to the requirement. Next, I run jx_indexer.py in experiment mode. Few hours passed, no error has occurred till now. Then, the same operation was performed on my workstation to create indices manually. The error Error: database or disk is full occurred again. So, it is indeed the insufficient space of the tmp dir on my workstation that caused this error. I will go ahead with the following steps. Thanks a lot for your help!

Best,

Liam

JulianneDavid commented 4 years ago

Hi Liam, Thanks for the update - I'm glad that was indeed the problem. I will add setting the temp directory as an optional command line parameter for future users, and close this issue.

Julianne