Sage-Bionetworks / schematic

Package for biomedical data model and metadata ingress management
https://schematicpy.readthedocs.io/en/stable/cli_reference.html
MIT License
22 stars 25 forks source link

Spaces removed from manifest column names when using `model/submit` endpoint #725

Closed lakikowolfe closed 2 years ago

lakikowolfe commented 2 years ago

Describe the bug When I submit a manifest using the model/submit endpoint the column names (ex: "Col Name") are modified to have no spaces (ex: "ColName").

To Reproduce Steps to reproduce the behavior:

  1. Download test manifest
  2. Modify to match new schema or find older schema (Age Is Obfuscated needs to be "True"/"False", HTAN Patient ID -> HTAN Participant ID)
  3. Save as csv
  4. Submit through model/submit endpoint

Expected behavior I expect that there would be no modification to the column names

Priority (select one)

Screenshots NA

Desktop (if applicable, please complete the following information):

Additional context NA

lakikowolfe commented 2 years ago

@milen-sage I didn't add a priority to this because I wasn't sure which to choose.

@ychae Milen told me to tag you in this in case you need to communicate with some stakeholders to hold off updating production systems

milen-sage commented 2 years ago

I'd rate this issue somewhere between major and critical. @lakikowolfe since you caught it before the bug's in production, let's mark it as a major issue for now.

However, this bug blocks releasing or using schematic code in production, so in days it could escalate to a critical issue. @ychae we'd need to coordinate with stakeholders to be on the lookout and don't push schematic's develop branch in production.

GiaJordan commented 2 years ago

@milen-sage From what I've found so far, the column renaming happens in upload_format_manifest_table, and then later in the submission process it gets saved: uplodad_manifest_file, this is the version that gets uploaded to synapse

GiaJordan commented 2 years ago

@ychae The renaming of manifest columns can be bypassed by setting -mrt to entity, confirmed by @lakikowolfe Will discuss further with @milen-sage and @mialy-defelice

GiaJordan commented 2 years ago

AC per slack thread w/ @milen-sage

I see, so the expected behavior is that:

  • in all cases, the csv manifest columns are not changed by default
  • if table or both the Synapse table column names can be changed, especially if Synapse doesn’t support spaces in the table columns
  • the latter is similar to how we handle annotation keys: we change the annotation keys, but we don’t change the manifest columns
  • a separate flag for changing column names, that by default preserves the column names, but can only be used in conjunction with table or both to rename the column
GiaJordan commented 2 years ago

It seems like there is already a boolean flag --use_schema_label/--use_display_label that have similar functionality, renaming the manifest attributes, but for annotation purposes. I can extend the functionality of this flag instead of adding a new one if it wouldn't be too confusing, but currently the default behavior is to change the names, which would conflict with the desired default behavior to not change the manifest column names.

@milen-sage @mialy-defelice

milen-sage commented 2 years ago

NM #763 is its own bug.