KarchinLab / open-cravat

A modular annotation tool for genomic variants
MIT License
113 stars 27 forks source link

filtersqlite #100

Closed spaul-genetics closed 7 months ago

spaul-genetics commented 2 years ago

I was using the filtersqlite as:

oc util filtersqlite main_file.sqlite -f filter.json -s filtered

The main_file.sqlite is 88Mb while the main_file.filtered.sqlite is 232Gb.

What's happening here?

rkimoakbioinformatics commented 2 years ago

That is indeed strange. Can you let me know the result of the following commands?

For main_file,
>sqlite3 main_file.sqlite
sqlite>select count(*) from variant;
sqlite>select count(*) from gene;
sqlite>select count(*) from sample;
sqlite>select count(*) from mapping;

For main_file.filtered.sqlite,
>sqlite3 main_file.filtered.qlite
sqlite>select count(*) from variant;
sqlite>select count(*) from gene;
sqlite>select count(*) from sample;
sqlite>select count(*) from mapping;

If there is any difference between the two count(*) for a table, further investigation can be done with

For main_file,
>sqlite3 main_file.sqlite
sqlite>select * from <table name> limit 50;

For main_file.filtered.sqlite,
sqlite>select * from <table name> limit 50;
spaul-genetics commented 2 years ago

Here are the counts. From the main sqlite file:

sqlite> .open toy_annot.sqlite
sqlite> select count(*) from variant;
711
sqlite> select count(*) from gene;
318
sqlite> select count(*) from sample;
7112
sqlite> select count(*) from mapping;
711

I tried using util filtersqlite and use filter.json file created using oc gui but ended up with sqlite3.OperationalError: no such table: viewersetup. I used oc report toy_annot.sqlite -t vcf -f toy_filter.json to filter into a vcf file and run oc on the filtered vcf.

The counts on this filtered database

sqlite> .open toy_annot_oc_filtered.sqlite
sqlite> select count(*) from variant;
51
sqlite> select count(*) from gene;
41
sqlite> select count(*) from sample;
332
sqlite> select count(*) from mapping;
51

I also used my SQL code to filter variants using same criteria. This is the count I get:

sqlite> .open toy_annot_sql_filtered.sqlite
sqlite> select count(*) from variant;
602
sqlite> select count(*) from gene;
231
sqlite> select count(*) from sample;
5800
sqlite> select count(*) from mapping;
602

I can provide the toy example data that I used to recreate the issue if necessary.

kmoad commented 1 year ago

Sorry for the slow response. If you are still interested in this issue, please send the toy example data.