fhcrc / taxtastic

Create and maintain phylogenetic "reference packages" of biological sequences.
GNU General Public License v3.0
21 stars 10 forks source link

update_nodes not incorporating is_valid #134

Closed dhoogest closed 4 years ago

dhoogest commented 4 years ago

For instance the following yaml fails to upsert node UW1127:

---
is_valid: false
names:
- is_classified: true
  is_primary: true
  name_class: synonym
  source_name: uw
  tax_id: UW1127
  tax_name: Lactobacillus rhamnosus or Lactobacillus casei
parent_id: '655183'
rank: species
source_name: uw
tax_id: UW1127
type: node
dhoogest commented 4 years ago

@crosenth this solves the problem for me, but may create other issues?

https://github.com/fhcrc/taxtastic/commit/fb3ca61caa0930d7e4b8c0b0e51c35cbb55ab247

crosenth commented 4 years ago

It might. What is the best way to reproduce this problem?

dhoogest commented 4 years ago

For instance with the above block in tmp.txt (attached), run

taxit -v add_nodes {url} tmp.yml

The node will be added with None in the is_valid column (given the function defaults). Adding the is_valid=is_valid, line to the appended statements properly reflects the is_valid value from the yml block for the node.

tmp.txt

crosenth commented 4 years ago

Dan, without any updates to the code what happens if you set is_valid: true in the tmp.txt yaml file?

For me:

is_valid: true

sqlite3 ncbi_taxonomy.db 'select * from nodes where tax_id = "UW1127"'
|--------+-----------+---------+-----------+-------------+-----------+----------|
| tax_id | parent_id | rank    | embl_code | division_id | source_id | is_valid |
|--------+-----------+---------+-----------+-------------+-----------+----------|
| UW1127 | 655183    | species |           |             | 2         | 1        |
|--------+-----------+---------+-----------+-------------+-----------+----------|

is_valid: false

sqlite3 ncbi_taxonomy.db 'select * from nodes where tax_id = "UW1127"'
|--------+-----------+---------+-----------+-------------+-----------+----------|
| tax_id | parent_id | rank    | embl_code | division_id | source_id | is_valid |
|--------+-----------+---------+-----------+-------------+-----------+----------|
| UW1127 | 655183    | species |           |             | 2         | 0        |
|--------+-----------+---------+-----------+-------------+-----------+----------|
dhoogest commented 4 years ago

For me (no code updates), with either a sqlite or postgres db target and either is_valid=true or is_valid=false, the column isn't populated (remains Null).

dhoogest commented 4 years ago

Excerpt from the taxit log output:

% taxit -vvv add_nodes sqlite:///ncbi_taxonomy.db  tmp.txt --source-name uw

...
INFO log 110 BEGIN (implicit)
2020-08-12 09:55:55,744 INFO sqlalchemy.engine.base.Engine INSERT INTO nodes (tax_id, parent_id, rank, source_id) VALUES (?, ?, ?, ?)
INFO log 110 INSERT INTO nodes (tax_id, parent_id, rank, source_id) VALUES (?, ?, ?, ?)
2020-08-12 09:55:55,744 INFO sqlalchemy.engine.base.Engine ('UW1127', '655183', 'species', 2)
INFO log 110 ('UW1127', '655183', 'species', 2)
2020-08-12 09:55:55,749 INFO sqlalchemy.engine.base.Engine UPDATE names SET is_primary=? WHERE names.tax_id = ?
INFO log 110 UPDATE names SET is_primary=? WHERE names.tax_id = ?
2020-08-12 09:55:55,749 INFO sqlalchemy.engine.base.Engine (0, 'UW1127')
INFO log 110 (0, 'UW1127')
2020-08-12 09:55:55,782 INFO sqlalchemy.engine.base.Engine INSERT INTO names (tax_id, tax_name, name_class, source_id, is_primary, is_classified) VALUES (?, ?, ?, ?, ?, ?)
INFO log 110 INSERT INTO names (tax_id, tax_name, name_class, source_id, is_primary, is_classified) VALUES (?, ?, ?, ?, ?, ?)
2020-08-12 09:55:55,782 INFO sqlalchemy.engine.base.Engine ('UW1127', 'Lactobacillus rhamnosus or Lactobacillus casei', 'synonym', 2, 1, 1)
INFO log 110 ('UW1127', 'Lactobacillus rhamnosus or Lactobacillus casei', 'synonym', 2, 1, 1)
2020-08-12 09:55:55,797 INFO sqlalchemy.engine.base.Engine COMMIT
INFO log 110 COMMIT

So the INSERT query on nodes isn't passing is_valid. This is with:

% taxit --version
taxit v0.9.0
dhoogest commented 4 years ago

With the change proposed in fb3ca61:

INFO log 110 (2,)
2020-08-12 10:07:06,225 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
INFO log 110 BEGIN (implicit)
2020-08-12 10:07:06,225 INFO sqlalchemy.engine.base.Engine INSERT INTO nodes (tax_id, parent_id, rank, source_id, is_valid) VALUES (?, ?, ?, ?, ?)
INFO log 110 INSERT INTO nodes (tax_id, parent_id, rank, source_id, is_valid) VALUES (?, ?, ?, ?, ?)
2020-08-12 10:07:06,225 INFO sqlalchemy.engine.base.Engine ('UW1127', '655183', 'species', 2, 0)
INFO log 110 ('UW1127', '655183', 'species', 2, 0)
2020-08-12 10:07:06,228 INFO sqlalchemy.engine.base.Engine UPDATE names SET is_primary=? WHERE names.tax_id = ?
INFO log 110 UPDATE names SET is_primary=? WHERE names.tax_id = ?
2020-08-12 10:07:06,229 INFO sqlalchemy.engine.base.Engine (0, 'UW1127')
INFO log 110 (0, 'UW1127')
2020-08-12 10:07:06,230 INFO sqlalchemy.engine.base.Engine INSERT INTO names (tax_id, tax_name, name_class, source_id, is_primary, is_classified) VALUES (?, ?, ?, ?, ?, ?)
INFO log 110 INSERT INTO names (tax_id, tax_name, name_class, source_id, is_primary, is_classified) VALUES (?, ?, ?, ?, ?, ?)
2020-08-12 10:07:06,230 INFO sqlalchemy.engine.base.Engine ('UW1127', 'Lactobacillus rhamnosus or Lactobacillus casei', 'synonym', 2, 1, 1)
INFO log 110 ('UW1127', 'Lactobacillus rhamnosus or Lactobacillus casei', 'synonym', 2, 1, 1)
2020-08-12 10:07:06,233 INFO sqlalchemy.engine.base.Engine COMMIT
INFO log 110 COMMIT

So, is_valid included in the INSERT. The true/false value from the yaml file is reflected properly in the database (tested in both sqlite and postgres)

crosenth commented 4 years ago

Dan you are right, I was going thru the update logic in the code which already has your proposed update. Please create a PR so we can merge

crosenth commented 3 years ago

https://github.com/fhcrc/taxtastic/releases/tag/v0.9.1