Sage-Bionetworks / sysbioDCCjsonschemas

SysBio DCC JSON schemas
1 stars 7 forks source link

AD dictionary table update failed #118

Closed avanlinden closed 2 years ago

avanlinden commented 2 years ago

I've successfully used the readme documentation to start a docker container and update the AD dictionary table using the create_syn_table_from_syn_schema.py script before. However, when I ran it yesterday, it deleted the existing table rows but did not successfully store the new table. The error I received is below:

Traceback (most recent call last): File "/sysbioDCCjsonschemas/code/python/create_Syn_table_from_Syn_schemas.py", line 187, in <module> main() File "/sysbioDCCjsonschemas/code/python/create_Syn_table_from_Syn_schemas.py", line 183, in main args.func(args, dccv_syn) File "/sysbioDCCjsonschemas/code/python/create_Syn_table_from_Syn_schemas.py", line 151, in process_overwrite_table table_out = syn.store(Table(dcc_val_table.id, syn_table_df)) File "/usr/local/lib/python3.9/site-packages/synapseclient/table.py", line 1382, in Table return CsvFileTable.from_data_frame(schema, values, **kwargs) File "/usr/local/lib/python3.9/site-packages/synapseclient/table.py", line 1779, in from_data_frame for col in schema.columns_to_store: AttributeError: 'str' object has no attribute 'columns_to_store'

I was able to use the script to generate a completely new table in a different project, and used that table to restore the dictionary table.

avanlinden commented 2 years ago

@danlu1 Appreciate your input!

avanlinden commented 2 years ago

Resolution: Tom Yu said that this is a bug in the latest Synapse client release, and it will be patched in 2.5.1. For now we should use version 2.4.0. The dockerfile does not specify which version of the python client to use, so when I built it locally and ran the container last week it pulled the newest release.


That being said,

@kelshmo Do you know if there's a reason Cindy set this up to delete the existing dictionary table rows before uploading the new rows? My impression was that usually Synapse is good about only updating rows that have changed on a synStore(Table()) operation so I'm not sure if this is needed. I was able to successfully update the AD table by commenting the delete command out.

https://github.com/Sage-Bionetworks/sysbioDCCjsonschemas/blob/bf034300ec7a2bec28defdbf4ea6ffe793e31888/code/python/create_Syn_table_from_Syn_schemas.py#L149

The benefit to not deleting is that if the script fails mid-run, there's no chance of having wiped the existing table (which is what happened to me).

kelshmo commented 2 years ago

I definitely think it is not advisable to delete the table rows. This is a bad practice many of us got in the habit of because managing the table row versions can present challenges:

Synapse tables can handle updating existing rows and appending new rows. If you need the table to be alphabetized, you will need to add some code that alters the index of the ROW_ID because tables can't handle a NA value in the middle of the table.

I suggest testing out all the outcomes: a row is added in the middle of the table, a row is added at the end of the table, a row is removed, a row is altered.

avanlinden commented 2 years ago

Since this table is back-end only for AD it's not alphabetized, and in fact the whole updated table is generated purely from schemas rather than downloading and altering existing rows, so row ids and versions should not be a factor here. But in the interest of maintaining a flexible script I will definitely test some outcomes! Thanks.

danlu1 commented 2 years ago

I also ran into the same issue as you did @avanlinden when I was trying to add a new study name (m6A-seqAutism) to dccvalidtor. But when I checked the dccvalidator, the new study name is in it even though I ran into error when ran the script. Super weird.

avanlinden commented 2 years ago

The current script does not handle row versions/row IDs, so commenting out the "delete table rows" part resulted in many duplicate rows. Since the original bug was related to an issue with the 2.5 version of the synapse python client, I went back to the initial function deleting the existing rows and am running it in a venv with synapseclient 2.4 until the patch comes out. Closing this.