frictionlessdata / forum

🗣 Frictionless Data Forum esp for "How do I" type questions
https://frictionlessdata.io/
10 stars 0 forks source link

repeating field name for declaring generic 'wide table' layout with FD? #18

Closed proccaserra closed 4 years ago

proccaserra commented 4 years ago

The problem: it is not that I can not define a FrictionlessData package for wide table but these are somehow single use.

The question: is there a way to come up with Frictionless Data specification that would allow the generic declaration of a wide table, thus promoting reuse through a more generic schema declaration.

Background In Life Science, so-called 'Omics' techniques (sequencing, mass spectrometry, nmr spectroscopy) produce highly parallel measurements of thousands of variables (transcripts, proteins, metabolites abundance) in larger and larger cohorts (i.e. from dozens of study subjects to thousands of them). The most frequent layout for reporting such measurements is usually a wide-table, where the first column holds the molecular identities (e.g. a gene transcript identifier) and additional columns correspond to one (or more) quantitation type(s) for each subject. In a simple case, a study with 100 participants with one data acquisition by participants results in a table with 101 columns and ~25000 rows (assuming 25000 transcripts surveyed by the sequencing event). If using the long table layout, we end up with 2.5 millions rows.

The following suggestions are first attempts at doing so:

 "fields": [
                    {
                        "name": "response_variable_name",
                        "title": "response variable name",
                        "description": "the response variable name is meant to hold a human readable label or common name denoting the entity",
                        "format": "default",
                        "type": "string",
                        "rdfType": "",
                        "constraints": {"required": true},
                       **"recurring" : False**
                    },
                    {
                        "name": "signal_in_acquisition_id",
                        "title": "signal",
                        "description": "the signal intensity recorded for the response variable in the subject,sample or specimen",
                        "format": "default",
                        "type": "number",
                        "rdfType": "",
                        "constraints": {"required": true},
                       **"recurring": True**
                    }
            ]
  1. allow for a 'pattern' to be specified for the repeating field name, to allow a regex to be provided to check field header names.

§. known issue with this example: unclear semantic as the 'signal_in_acquisition_id' combines 2 dimensions of a data tuple (hence the suggestions of complex field, which would allow a table with 2 row header to be represented).

ongoing discussion with @lwinfree and @roll

roll commented 4 years ago

@proccaserra If I got you right it's not possible with the specs as they require static declarations cc @rufuspollock

Usually, we solve a problem like this having an additional level of logic on top of standards designed for a specific use case (e.g. to generate a table schema from your pattern)

rufuspollock commented 4 years ago

@proccaserra @roll could you guys explain the use case a bit more? I don't understand what a wide table is vs a long table? Are these terms defined somewhere?

proccaserra commented 4 years ago

thx both for following up: @roll @rufuspollock , to use R /tidy data I refer to the following: "wide table" aka messy [*]( as many columns as there are conditions):

#              treatmenta treatmentb
## John Smith           NA          2
## Jane Doe             16         11
## Mary Johnson          3          1

"long table" aka tidy[*] (fixed number of columns/fields):

##           name        trt result
## 1   John Smith treatmenta     NA
## 2     Jane Doe treatmenta     16
## 3 Mary Johnson treatmenta      3
## 4   John Smith treatmentb      2
## 5     Jane Doe treatmentb     11
## 6 Mary Johnson treatmentb      1

@roll, would it possible to point me to examples of how this is done?

[*](examples taken from: http://www.milanor.net/blog/reshape-data-r-tidyr-vs-reshape2/)

rufuspollock commented 4 years ago

@proccaserra ahhhh - i would call this pivoted vs normalized / unpivoted 😄

Can you help me with the actual job stories? Is it that you want to pivot/wide for display or for compressing space or for ...? I ask as the use case really alters how we could try to support it.

roll commented 4 years ago

@proccaserra I can't remember of concrete implementations relevant enough to show but the main idea that using our tools we can generate table schemas from existent data files or from patterns making sense to your domain e.g.

# we implement a Schema wrapper like GeneticsSchema for the pilot
schema = GeneticsSchema.fromSchemaPattern('schema-pattern.csv')
schema.descriptor # normal FD schema

While it's generated by a machine it doesn't really matter how many fields it has. Above is needed if you want to store/share the data in a "wide-table" form.

If it's ok to store/share data in a "long-table" form there are data processing tools like dataflows to unpivot data before publishing.

proccaserra commented 4 years ago

@rufuspollock job story 1: using ggplot 2 R library, I execute various plotting options. the library requires in normalized/long table/tidy table as input. This layout information is not available from input files received from external source unless these are parsed. Having FD table description greatly improves input quality but having a metadata element to indicate the "layout" of the matrix

job story 2: when rendering tabular data to users (for quick exploration), the unormalized/wide table orientation is preferred. (the layout is also the most commonly generated by analysis tools).

I guess my main 'constraint' was: Can I define a single FD tabular package definition for a known pattern? But @roll solution would definitely work. Was wondering if the json declaration could reference the pattern mentioned.

The question probably morphed into supporting FD json generation during matrix reshaping. Thx both.

rufuspollock commented 4 years ago

@proccaserra for job story 1 if the data is in long form and the table schema is for that you're fine right?

For job story 2 i think (?) you just want a pivot option for transforming the table schema? Is that right?

proccaserra commented 4 years ago

@rufuspollock , job story1 is indeed covered and I can produce 'general purpose' table definition, but then users aren't 100% won over.

job story2: I'd say so with the added question regarding feasibility: 10 subjects would mean 10 new fields in the JSON file describing the table. I was wondering about the possibility of declaring a field pattern, with an attribute 'number of occurrence:', 10 in our case. It would have the advantage of 1. keeping the JSON file small, 2. allowing to define a template once and reuse/repurpose it. The drawbacks could be that of adding complexity to the existing model.

rufuspollock commented 4 years ago

@proccaserra

but then users aren't 100% won over

Can you say a bit more about that. In what way are they not won over?

job story 2: ... I was wondering about the possibility of declaring a field pattern, with an attribute 'number of occurrence:', 10 in our case. It would have the advantage of 1. keeping the JSON file small, 2. allowing to define a template once and reuse/repurpose it. The drawbacks could be that of adding complexity to the existing model.

Can you spec out a bit how you imagine your field pattern working e.g. the addition to the spec and then the way the tooling would use that. It would help me get a better feel for how this would work and if we could implement 😄

proccaserra commented 4 years ago

@rufuspollock

rufuspollock commented 4 years ago

@proccaserra great - and on spec'ing it out it can be very simple e.g.:

... content
number_of_occurences: ...
...

Description of how a tool would use this ...

lwinfree commented 4 years ago

Hey @rufuspollock thanks for helping here! I'm wondering if you've checked out Phil's proposed specs in the 1st comment? Would having a repeated element/pattern for the fields be allowed in the specs?

rufuspollock commented 4 years ago

@lwinfree i did and good to flag as initially i was just trying to understand the use case. I also didn't really understand the spec e.g. why "recurring" so my request for a simple spec plus how exactly a tool would use this would really help.

roll commented 4 years ago

I think it's more for the software to handle.

E.g. the specs don't have skipRows but it's implemented everywhere from tabulator to dataflows. It's an additional API provided by the implementations preserving the specs stable and generic

proccaserra commented 4 years ago

+1 @roll. Following further discussions with @lwinfree and @roll resulted in clarifying functionalities available from existing tools as well as specific points in the specifications. Based on these discussions I guess this issue can be closed. However, it seems that a new need has emerged: It has to do with the rdfType attribute , which currently accepts only one value.

lwinfree commented 4 years ago

@proccaserra if you don't mind, could you write up the rdfType issue so we can discuss it too? Thx!

proccaserra commented 4 years ago

@lwinfree, sure. I'll start discussing here but can be moved to a dedicated issue if deemed necessary.

In the case of complex header (as made available using the tabulator function, a more complex expression may be needed and was wondering about the use of JSON-LD context files in conjunction with FrictionlessData (I am aware of existing issues on the tracker)

proccaserra commented 4 years ago

@lwinfree to qualify further:

With a field such as condition1.auc, setting the rdfType with "http://purl.obolibrary.org/obo/STATO_0000209" for area under curve tells only part of the story. but before making a request for change on the rdfType property, we need to further refine the scenario behind the context for a transformation into RDF triple.

rufuspollock commented 4 years ago

@proccaserra could you open a separate issue about multi-valued rdfType and suggest doing that in the specs tracker: https://github.com/frictionlessdata/specs/issues

rufuspollock commented 4 years ago

FIXED. It looks like we have a resolution for original need from existing tools.