Sage-Bionetworks / schematic

Package for biomedical data model and metadata ingress management
https://schematicpy.readthedocs.io/en/latest/cli_reference.html
MIT License
22 stars 25 forks source link

Difference in how annotations are retrieved #459

Closed allaway closed 3 years ago

allaway commented 3 years ago

Describe the bug As noted in #456, @BrunoGrandePhD and @sujaypatil96 get different values when retrieving values from synapse. I get True/False, while they get TRUE/FALSE. I also noticed on my most recent try that I get Text type numbers converted to doubles. Example here: https://docs.google.com/spreadsheets/d/1WmkQU76LVORynmSC56tGfZOKXD21Uu-2ZoxAP_ZGXKE

To Reproduce This manifest can be generated by following the instructions I described in #456.

Expected behavior I expect the value to be retrieved as TRUE, rather than True. IOW, I expect Sujay/Bruno's manifest but I get mine.

Additional context

allaway commented 3 years ago

I tried this with a newer dataset and TRUE/FALSE is as expected:

(data_curator_env) ➜  schematic-upstream git:(develop) ✗ schematic manifest --config config.yml get -t "JHU Biobank RNA seq" -dt GenomicsAssay -d syn23529662 -s --use_annotations --oauth -p ~/Downloads/NF.jsonld
Starting schematic...
The (model > input > validation_schema) argument with value 'data/validation_schemas/example_validation_schema.json' is being read from the config file.
The '--json_schema' argument is being taken from configuration file (model > input > validation_schema), i.e., 'data/validation_schemas/example_validation_schema.json'.
warning: The `use_annotations` option is currently only supported when there is no manifest file for the dataset in question.
 [####################]100.00%   1/1   Done...
Downloading  [##########----------]49.45%   8.0MB/16.2MB (4.1MB/s) SYNAPSE_TABLE_QUERY_75279483.cDownloading  [####################]98.89%   16.0MB/16.2MB (5.4MB/s) SYNAPSE_TABLE_QUERY_75279483.Downloading  [####################]100.00%   16.2MB/16.2MB (5.4MB/s) SYNAPSE_TABLE_QUERY_75279483.csv.synapse_download_75279483 Done...
    [WARNING] /opt/miniconda3/envs/data_curator_env/lib/python3.9/site-packages/schematic/manifest/generator.py:877: DtypeWarning: Columns (7,27,34,36,37,38,40,42,44,45,49,50,52) have mixed types.Specify dtype option on import or set low_memory=False.
  syn_store = SynapseStorage()

warning: /opt/miniconda3/envs/data_curator_env/lib/python3.9/site-packages/schematic/manifest/generator.py:877: DtypeWarning: Columns (7,27,34,36,37,38,40,42,44,45,49,50,52) have mixed types.Specify dtype option on import or set low_memory=False.
warning:   syn_store = SynapseStorage()
Using slower (non-batch) sequential mode
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as data/json_schema_logs/json_schema_log.json
Permission Id: anyoneWithLink
Find the manifest template using this Google Sheet URL:
https://docs.google.com/spreadsheets/d/1xfn3WmJKWnmm3Jv0VfqGJLEFmFISOn5oirYZzhACynk

The only difference that I can see, other than the age of the data, is that the one that worked as expected used non-batch mode, whereas the CTFcNFWGS used the faster batch-based (fileview, i think?) mode to retrieve annotations.

BrunoGrandePhD commented 3 years ago

@allaway: I'll revisit this tomorrow when I review GitHub PRs, but I think you've hit the nail on the head. Thanks for taking the time to investigate!

@sujaypatil96 and I would be forced to use the slower non-batch method because we don't have permission on the Synapse project, whereas you were able to create the file view and use the faster batch method. Also, I wouldn't be surprised if retrieving annotations using the file view differs from using the entity annotations API. We already have a few functions to rectify the differences.

Note to self: Add a new _fix_boolean_columns() function to output TRUE and FALSE consistently, and update the unit tests to cover this edge case.

BrunoGrandePhD commented 3 years ago

I've started a draft PR (#462) to address this issue.

allaway commented 3 years ago

One note: I think Sujay does have permission to create a fileview on the project in question, because Sujay is on this team: https://www.synapse.org/#!Team:3378999 (unless @sujaypatil96 using a different service account with schematic, which might also be the case)...just wanted to mention that in case my suspicion above is wrong.

BrunoGrandePhD commented 3 years ago

@sujaypatil96: Can you reconcile @allaway's observation with my hypothesis?

sujaypatil96 commented 3 years ago

@BrunoGrandePhD @allaway: I ran the below command from here, and this is what the JHU Biobank RNA seq spreadsheet looks like.

schematic manifest --config config.yml get -t "JHU Biobank RNA seq" dt GenomicsAssay -d syn23529662 -s --use_annotations --oauth -p data/schema_org_schemas/NF.jsonld

I'm seeing the following log message after execution: Using slower (non-batch) sequential mode.

As for the CTFcNFWGS manifest, below is the command I ran, and here is the manifest.

schematic manifest --config config.yml get -t CTFcNFWGS -dt ImagingAssay -d syn4984626 -s --use_annotations --oauth -p /data/schema_org_schemas/NF.jsonld

Here are the log messages from the above command:

    Unable to create a temporary file view bound to syn4984626. Defaulting to slower iterative retrieval of annotations.
Batch mode failed (probably due to permission error)
Using slower (non-batch) sequential mode

So it would appear that I'm not able to create file views on the project either @allaway, that's strange.

allaway commented 3 years ago

While it's not clear to me why that is the case (are you maybe using a service account? or maybe there's some other permission setting in that project that I'm missing?), it does suggest that this issue is caused by batch vs sequential mode.

BrunoGrandePhD commented 3 years ago

I did some additional digging to see where the switch from TRUE/FALSE to True/False happens, and I've isolated the bug to the asDataFrame() method in the Synapse Python client.

Here's how I determined this. I queried all rows in this table, which includes two boolean columns (one with type boolean and one with type text). When I query this table using synapseclient, I get the problematic output with True/False under IsImportantText.

>>> import synapseclient
>>> syn = synapseclient.login()
>>> query = syn.tableQuery('SELECT * FROM syn25705259')
>>> query.asDataFrame()
                                                          id IsImportantBool IsImportantText
25614636_1_a416d0e1-d046-48f8-8343-ba3d44b25003  syn25614636            True            True
25614637_1_f75031db-a640-4203-8830-bfdf22b7df60  syn25614637             NaN             NaN
25614638_1_dacb41a1-500b-40eb-87f0-fd97dfc7d958  syn25614638           False           False
    >>> query.filepath
'/Users/bgrande/.synapseCache/355/76009355/SYNAPSE_TABLE_QUERY_76009355.csv'

However, when I inspect the CSV file that was downloaded as part of the query, I can confirm that the raw data retains the TRUE/FALSE format. So, I suspect that Pandas might be interpreting the IsImportantText column as a boolean and converting the values to True/False (i.e. Python booleans).

❯ cat /Users/bgrande/.synapseCache/355/76009355/SYNAPSE_TABLE_QUERY_76009355.csv
"ROW_ID","ROW_VERSION","ROW_ETAG","id","IsImportantBool","IsImportantText"
"25614636","1","a416d0e1-d046-48f8-8343-ba3d44b25003","syn25614636","true","TRUE"
"25614637","1","f75031db-a640-4203-8830-bfdf22b7df60","syn25614637",,
"25614638","1","dacb41a1-500b-40eb-87f0-fd97dfc7d958","syn25614638","false","FALSE"

I'll report this bug on JIRA. I was hoping that I could get around this by loading the CSV myself and forcing pandas to interpret everything as strings, but I don't think it's that simple based on how long the asDataFrame() method is. I'll ask in the JIRA ticket if they have a suggested quick fix (assuming it will take them more than one month to push a new version of synapseclient with a fix).

@milen-sage: So, it's caused by the Synapse Python client and not the Synapse service.

milen-sage commented 3 years ago

Thanks for sleuthing @BrunoGrandePhD. Yes, this makes sense. Hopefully the Synapse team (e.g. Jordan) have a workaround this bug in the client.

These sorts of issues are one of the reasons in schematic we decided to treat the synapse_storage_manifest.csv as the source of truth for metadata (as opposed to annotations); existing fileview schemas and how those are interpreted downstream by clients (and respective df packages in R and python) introduce additional edge cases to handle. Once pre-existing NF annotations are ingressed via schematic these sorts of issues will be reduced (i.e. the manifest csv's are both automatically versioned when changed and do not have the Synapse and python client dataframe services as intermediaries). This coupled with generating data-portal compatible tables directly from schematic, based on manifests, would ensure metadata for downstream applications (e.g. data portal, projectLive, etc) are handled more consistently. Once Synapse has proper support for file annotation schemas these issues will be alleviated.

BrunoGrandePhD commented 3 years ago

@milen-sage: To be fair, we're swimming against the current by treating everything as a string, which departs from the majority of developers (i.e. the ones who are served by pandas inferring types). Unfortunately, we can't set types in CSV files, so we have to turn off type inference everywhere until we can provide types based on the data model. In fact, we should start doing this and default to the string type until we implement more sophisticated types in schematic (as we discussed in code review a few weeks ago).

By the way, I opened SYNPY-1150 to address this.

milen-sage commented 3 years ago

Just a side note: treating everything as a string (or bit streams for that matter) is not a necessarily bad idea for storage and communication protocols (e.g. json; tcp/ip); it's up to downstream applications to interpret data types based on data model schemas/protocol standards associated with the stored data/communication packets. Given that, supporting data types explicitly in the data model schema definitions becomes a requirement that schematic needs to implement as you point out.

Another aside, when Synapse releases JSONSchema support for annotations (in a year from now), the data types for each annotation key and value pair will be

milen-sage commented 3 years ago

Thanks for starting the issue @BrunoGrandePhD !

BrunoGrandePhD commented 3 years ago

I just realized that I pointed to the wrong asDataFrame() earlier. This method is the one that's relevant. There's still a lot happening in this method and the _csv_to_pandas_df() function that's called therein.

Accordingly, I'm going to wait until I hear back on the JIRA ticket to see whether it's worth it for us to implement a fix in the meantime.