Difference in how annotations are retrieved

allaway commented 3 years ago

Describe the bug As noted in #456, @BrunoGrandePhD and @sujaypatil96 get different values when retrieving values from synapse. I get True/False, while they get TRUE/FALSE. I also noticed on my most recent try that I get Text type numbers converted to doubles. Example here: https://docs.google.com/spreadsheets/d/1WmkQU76LVORynmSC56tGfZOKXD21Uu-2ZoxAP_ZGXKE

To Reproduce This manifest can be generated by following the instructions I described in #456.

Expected behavior I expect the value to be retrieved as TRUE, rather than True. IOW, I expect Sujay/Bruno's manifest but I get mine.

Additional context

I tried this on both the shiny server staging conda env I have as well as local, and I get the same results both places.
All of the annotations are of type "Text" - they are not true booleans, integers, etc. The dataset in question is very old, relatively speaking - 2016 or so - and I think it may have been at a point where we didn't have types for annotations, but I'm not confident about this.

As requested, here's the output of pip freeze:

(data_curator_env) ➜  schematic-upstream git:(develop) ✗ pip freeze
appdirs==1.4.4
attrs==20.3.0
CacheControl==0.12.6
cachetools==4.2.1
cachy==0.3.0
certifi==2020.12.5
chardet==4.0.0
cleo==0.8.1
click==7.1.2
click-log==0.3.2
clikit==0.6.2
crashtest==0.3.1
decorator==4.4.2
Deprecated==1.2.12
distlib==0.3.1
entrypoints==0.3
filelock==3.0.12
google-api-core==1.26.3
google-api-python-client==1.12.8
google-auth==1.28.1
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.4
googleapis-common-protos==1.53.0
graphviz==0.16
html5lib==1.1
httplib2==0.19.1
idna==2.10
inflection==0.5.1
isodate==0.6.0
jsonschema==3.2.0
keyring==12.0.2
lockfile==0.12.2
msgpack==1.0.2
networkx==2.5.1
numpy==1.20.2
oauth2client==3.0.0
oauthlib==3.1.0
packaging==20.9
pandas==1.2.4
pastel==0.2.1
pexpect==4.8.0
pkginfo==1.7.0
poetry==1.1.5
poetry-core==1.0.3
protobuf==3.15.8
ptyprocess==0.7.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pygsheets==2.0.5
pylev==1.3.0
pyparsing==2.4.7
pyrsistent==0.17.3
python-dateutil==2.8.1
pytz==2021.1
PyYAML==5.4.1
rdflib==5.0.0
requests==2.25.1
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rsa==4.7.2
schematicpy @ git+https://github.com/Sage-Bionetworks/schematic.git@055c05819007cf5eaad199a2143bf200d3f86934
shellingham==1.4.0
six==1.15.0
synapseclient==2.3.0
toml==0.10.2
tomlkit==0.7.0
uritemplate==3.0.1
urllib3==1.26.4
virtualenv==20.4.3
webencodings==0.5.1
wrapt==1.12.1

Here's the output of conda list:


(data_curator_env) ➜  schematic-upstream git:(develop) ✗ conda list
# packages in environment at /opt/miniconda3/envs/data_curator_env:
#
# Name                    Version                   Build  Channel
appdirs                   1.4.4                    pypi_0    pypi
attrs                     20.3.0                   pypi_0    pypi
ca-certificates           2021.1.19            hecd8cb5_1  
cachecontrol              0.12.6                   pypi_0    pypi
cachetools                4.2.1                    pypi_0    pypi
cachy                     0.3.0                    pypi_0    pypi
certifi                   2020.12.5        py39hecd8cb5_0  
chardet                   4.0.0                    pypi_0    pypi
cleo                      0.8.1                    pypi_0    pypi
click                     7.1.2                    pypi_0    pypi
click-log                 0.3.2                    pypi_0    pypi
clikit                    0.6.2                    pypi_0    pypi
crashtest                 0.3.1                    pypi_0    pypi
decorator                 4.4.2                    pypi_0    pypi
deprecated                1.2.12                   pypi_0    pypi
distlib                   0.3.1                    pypi_0    pypi
entrypoints               0.3                      pypi_0    pypi
filelock                  3.0.12                   pypi_0    pypi
google-api-core           1.26.3                   pypi_0    pypi
google-api-python-client  1.12.8                   pypi_0    pypi
google-auth               1.28.1                   pypi_0    pypi
google-auth-httplib2      0.0.4                    pypi_0    pypi
google-auth-oauthlib      0.4.4                    pypi_0    pypi
googleapis-common-protos  1.53.0                   pypi_0    pypi
html5lib                  1.1                      pypi_0    pypi
httplib2                  0.19.1                   pypi_0    pypi
idna                      2.10                     pypi_0    pypi
inflection                0.5.1                    pypi_0    pypi
isodate                   0.6.0                    pypi_0    pypi
jsonschema                3.2.0                    pypi_0    pypi
keyring                   12.0.2                   pypi_0    pypi
libcxx                    10.0.0                        1  
libffi                    3.3                  hb1e8313_2  
lockfile                  0.12.2                   pypi_0    pypi
msgpack                   1.0.2                    pypi_0    pypi
ncurses                   6.2                  h0a44026_1  
networkx                  2.5.1                    pypi_0    pypi
numpy                     1.20.2                   pypi_0    pypi
oauth2client              3.0.0                    pypi_0    pypi
oauthlib                  3.1.0                    pypi_0    pypi
openssl                   1.1.1k               h9ed2024_0  
packaging                 20.9                     pypi_0    pypi
pandas                    1.2.4                    pypi_0    pypi
pastel                    0.2.1                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pip                       21.0.1           py39hecd8cb5_0  
pkginfo                   1.7.0                    pypi_0    pypi
poetry                    1.1.5                    pypi_0    pypi
poetry-core               1.0.3                    pypi_0    pypi
protobuf                  3.15.8                   pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pygsheets                 2.0.5                    pypi_0    pypi
pylev                     1.3.0                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyrsistent                0.17.3                   pypi_0    pypi
python                    3.9.2                h88f2d9e_0  
python-dateutil           2.8.1                    pypi_0    pypi
python-graphviz           0.16                     pypi_0    pypi
pytz                      2021.1                   pypi_0    pypi
pyyaml                    5.4.1                    pypi_0    pypi
rdflib                    5.0.0                    pypi_0    pypi
readline                  8.1                  h9ed2024_0  
requests                  2.25.1                   pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
requests-toolbelt         0.9.1                    pypi_0    pypi
rsa                       4.7.2                    pypi_0    pypi
schematicpy               0.1.11                   pypi_0    pypi
setuptools                52.0.0           py39hecd8cb5_0  
shellingham               1.4.0                    pypi_0    pypi
six                       1.15.0                   pypi_0    pypi
sqlite                    3.35.4               hce871da_0  
synapseclient             2.3.0                    pypi_0    pypi
tk                        8.6.10               hb0a8c7a_0  
toml                      0.10.2                   pypi_0    pypi
tomlkit                   0.7.0                    pypi_0    pypi
tzdata                    2020f                h52ac0ba_0  
uritemplate               3.0.1                    pypi_0    pypi
urllib3                   1.26.4                   pypi_0    pypi
virtualenv                20.4.3                   pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3eb1b0_0  
wrapt                     1.12.1                   pypi_0    pypi
xz                        5.2.5                h1de35cc_0  
zlib                      1.2.11               h1de35cc_3

allaway commented 3 years ago

I tried this with a newer dataset and TRUE/FALSE is as expected:

(data_curator_env) ➜  schematic-upstream git:(develop) ✗ schematic manifest --config config.yml get -t "JHU Biobank RNA seq" -dt GenomicsAssay -d syn23529662 -s --use_annotations --oauth -p ~/Downloads/NF.jsonld
Starting schematic...
The (model > input > validation_schema) argument with value 'data/validation_schemas/example_validation_schema.json' is being read from the config file.
The '--json_schema' argument is being taken from configuration file (model > input > validation_schema), i.e., 'data/validation_schemas/example_validation_schema.json'.
warning: The `use_annotations` option is currently only supported when there is no manifest file for the dataset in question.
 [####################]100.00%   1/1   Done...
Downloading  [##########----------]49.45%   8.0MB/16.2MB (4.1MB/s) SYNAPSE_TABLE_QUERY_75279483.cDownloading  [####################]98.89%   16.0MB/16.2MB (5.4MB/s) SYNAPSE_TABLE_QUERY_75279483.Downloading  [####################]100.00%   16.2MB/16.2MB (5.4MB/s) SYNAPSE_TABLE_QUERY_75279483.csv.synapse_download_75279483 Done...
    [WARNING] /opt/miniconda3/envs/data_curator_env/lib/python3.9/site-packages/schematic/manifest/generator.py:877: DtypeWarning: Columns (7,27,34,36,37,38,40,42,44,45,49,50,52) have mixed types.Specify dtype option on import or set low_memory=False.
  syn_store = SynapseStorage()

warning: /opt/miniconda3/envs/data_curator_env/lib/python3.9/site-packages/schematic/manifest/generator.py:877: DtypeWarning: Columns (7,27,34,36,37,38,40,42,44,45,49,50,52) have mixed types.Specify dtype option on import or set low_memory=False.
warning:   syn_store = SynapseStorage()
Using slower (non-batch) sequential mode
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as data/json_schema_logs/json_schema_log.json
Permission Id: anyoneWithLink
Find the manifest template using this Google Sheet URL:
https://docs.google.com/spreadsheets/d/1xfn3WmJKWnmm3Jv0VfqGJLEFmFISOn5oirYZzhACynk

The only difference that I can see, other than the age of the data, is that the one that worked as expected used non-batch mode, whereas the CTFcNFWGS used the faster batch-based (fileview, i think?) mode to retrieve annotations.

BrunoGrandePhD commented 3 years ago

@allaway: I'll revisit this tomorrow when I review GitHub PRs, but I think you've hit the nail on the head. Thanks for taking the time to investigate!

@sujaypatil96 and I would be forced to use the slower non-batch method because we don't have permission on the Synapse project, whereas you were able to create the file view and use the faster batch method. Also, I wouldn't be surprised if retrieving annotations using the file view differs from using the entity annotations API. We already have a few functions to rectify the differences.

Note to self: Add a new _fix_boolean_columns() function to output TRUE and FALSE consistently, and update the unit tests to cover this edge case.

BrunoGrandePhD commented 3 years ago

I've started a draft PR (#462) to address this issue.

allaway commented 3 years ago

One note: I think Sujay does have permission to create a fileview on the project in question, because Sujay is on this team: https://www.synapse.org/#!Team:3378999 (unless @sujaypatil96 using a different service account with schematic, which might also be the case)...just wanted to mention that in case my suspicion above is wrong.

BrunoGrandePhD commented 3 years ago

@sujaypatil96: Can you reconcile @allaway's observation with my hypothesis?

sujaypatil96 commented 3 years ago

@BrunoGrandePhD @allaway: I ran the below command from here, and this is what the JHU Biobank RNA seq spreadsheet looks like.

schematic manifest --config config.yml get -t "JHU Biobank RNA seq" dt GenomicsAssay -d syn23529662 -s --use_annotations --oauth -p data/schema_org_schemas/NF.jsonld

I'm seeing the following log message after execution: Using slower (non-batch) sequential mode.

As for the CTFcNFWGS manifest, below is the command I ran, and here is the manifest.

schematic manifest --config config.yml get -t CTFcNFWGS -dt ImagingAssay -d syn4984626 -s --use_annotations --oauth -p /data/schema_org_schemas/NF.jsonld

Here are the log messages from the above command:

    Unable to create a temporary file view bound to syn4984626. Defaulting to slower iterative retrieval of annotations.
Batch mode failed (probably due to permission error)
Using slower (non-batch) sequential mode

So it would appear that I'm not able to create file views on the project either @allaway, that's strange.

allaway commented 3 years ago

While it's not clear to me why that is the case (are you maybe using a service account? or maybe there's some other permission setting in that project that I'm missing?), it does suggest that this issue is caused by batch vs sequential mode.

BrunoGrandePhD commented 3 years ago

I did some additional digging to see where the switch from TRUE/FALSE to True/False happens, and I've isolated the bug to the asDataFrame() method in the Synapse Python client.

Here's how I determined this. I queried all rows in this table, which includes two boolean columns (one with type boolean and one with type text). When I query this table using synapseclient, I get the problematic output with True/False under IsImportantText.

>>> import synapseclient
>>> syn = synapseclient.login()
>>> query = syn.tableQuery('SELECT * FROM syn25705259')
>>> query.asDataFrame()
                                                          id IsImportantBool IsImportantText
25614636_1_a416d0e1-d046-48f8-8343-ba3d44b25003  syn25614636            True            True
25614637_1_f75031db-a640-4203-8830-bfdf22b7df60  syn25614637             NaN             NaN
25614638_1_dacb41a1-500b-40eb-87f0-fd97dfc7d958  syn25614638           False           False
    >>> query.filepath
'/Users/bgrande/.synapseCache/355/76009355/SYNAPSE_TABLE_QUERY_76009355.csv'

However, when I inspect the CSV file that was downloaded as part of the query, I can confirm that the raw data retains the TRUE/FALSE format. So, I suspect that Pandas might be interpreting the IsImportantText column as a boolean and converting the values to True/False (i.e. Python booleans).

❯ cat /Users/bgrande/.synapseCache/355/76009355/SYNAPSE_TABLE_QUERY_76009355.csv
"ROW_ID","ROW_VERSION","ROW_ETAG","id","IsImportantBool","IsImportantText"
"25614636","1","a416d0e1-d046-48f8-8343-ba3d44b25003","syn25614636","true","TRUE"
"25614637","1","f75031db-a640-4203-8830-bfdf22b7df60","syn25614637",,
"25614638","1","dacb41a1-500b-40eb-87f0-fd97dfc7d958","syn25614638","false","FALSE"

I'll report this bug on JIRA. I was hoping that I could get around this by loading the CSV myself and forcing pandas to interpret everything as strings, but I don't think it's that simple based on how long the asDataFrame() method is. I'll ask in the JIRA ticket if they have a suggested quick fix (assuming it will take them more than one month to push a new version of synapseclient with a fix).

@milen-sage: So, it's caused by the Synapse Python client and not the Synapse service.

milen-sage commented 3 years ago

Thanks for sleuthing @BrunoGrandePhD. Yes, this makes sense. Hopefully the Synapse team (e.g. Jordan) have a workaround this bug in the client.

These sorts of issues are one of the reasons in schematic we decided to treat the synapse_storage_manifest.csv as the source of truth for metadata (as opposed to annotations); existing fileview schemas and how those are interpreted downstream by clients (and respective df packages in R and python) introduce additional edge cases to handle. Once pre-existing NF annotations are ingressed via schematic these sorts of issues will be reduced (i.e. the manifest csv's are both automatically versioned when changed and do not have the Synapse and python client dataframe services as intermediaries). This coupled with generating data-portal compatible tables directly from schematic, based on manifests, would ensure metadata for downstream applications (e.g. data portal, projectLive, etc) are handled more consistently. Once Synapse has proper support for file annotation schemas these issues will be alleviated.

BrunoGrandePhD commented 3 years ago

@milen-sage: To be fair, we're swimming against the current by treating everything as a string, which departs from the majority of developers (i.e. the ones who are served by pandas inferring types). Unfortunately, we can't set types in CSV files, so we have to turn off type inference everywhere until we can provide types based on the data model. In fact, we should start doing this and default to the string type until we implement more sophisticated types in schematic (as we discussed in code review a few weeks ago).

By the way, I opened SYNPY-1150 to address this.

milen-sage commented 3 years ago

Just a side note: treating everything as a string (or bit streams for that matter) is not a necessarily bad idea for storage and communication protocols (e.g. json; tcp/ip); it's up to downstream applications to interpret data types based on data model schemas/protocol standards associated with the stored data/communication packets. Given that, supporting data types explicitly in the data model schema definitions becomes a requirement that schematic needs to implement as you point out.

Another aside, when Synapse releases JSONSchema support for annotations (in a year from now), the data types for each annotation key and value pair will be

validated on ingress
communicated in a standard and explicit way to downstream applications/services (e.g. fileviews, pandas/R df packages, etc.), instead of relying on type inference in each application, which can be handled inconsistently across different applications. (Handling may differ both in terms of inferred types but also in terms of serialization choices based on the inferred types.)

milen-sage commented 3 years ago

Thanks for starting the issue @BrunoGrandePhD !

BrunoGrandePhD commented 3 years ago

I just realized that I pointed to the wrong asDataFrame() earlier. This method is the one that's relevant. There's still a lot happening in this method and the _csv_to_pandas_df() function that's called therein.

Accordingly, I'm going to wait until I hear back on the JIRA ticket to see whether it's worth it for us to implement a fix in the meantime.

Sage-Bionetworks / schematic

Difference in how annotations are retrieved #459