CONP-PCNO / conp-dataset

:open_file_folder: A DataLad dataset for CONP
http://conp.ca
MIT License
19 stars 33 forks source link

Derived BigBrain datasets #230

Closed glatard closed 4 years ago

glatard commented 4 years ago

Purpose

Integrate derived BigBrain datasets in the portal, as a first use-case of derived data support. Make sure that derived dataset can be linked to their parent (and conversely), in the portal and in DataLad.

Context

Derived datasets are an important use case. BigBrain is a good case to start from, as there are a lot of derivatives, and it is an open dataset.

Possible Implementation

Start with https://portal.conp.ca/dataset?id=projects/Khanlab/BigBrainMRICoreg, as it is already in CONP.

Proposed work-plan:

  1. Look at https://portal.conp.ca/dataset?id=projects/Khanlab/BigBrainMRICoreg (derived dataset), find out to which BigBrain files it has been derived from (parent dataset).
  2. Add the parent dataset to CONP if it is not https://portal.conp.ca/dataset?id=projects/bigbrain-datalad
  3. Add the parent DataLad dataset as sub-dataset of the derived DataLad dataset (make a PR to https://portal.conp.ca/dataset?id=projects/Khanlab/BigBrainMRICoreg)
  4. Check with DATS schema developers (@emmetaobrien and @zxenia) how to represent parent/derived datasets in DATS.
  5. Check with portal developers (@liamocn, @xlecours) how to represent derived data in the portal
  6. Repeat 1, 2, 3, with other derived datasets.
  7. Yay!
  8. Oh, and add documentation on how to add derived datasets to CONP :)

@cmadjar, @akhanf, @samirdas

akhanf commented 4 years ago

@Martybird: is the current conp portal the most up to date data? As I recall, the reviewers asked data to be uploaded to open neuro, so the dataset there may be more up to date

Martybird commented 4 years ago

The version on openNeuro and OSF are the most up-to-date indeed.

cmadjar commented 4 years ago

@akhanf Any chance you could update the dataset on CONP so that we have the latest version.

I will investigate on my end how we can link the raw and processed datasets on CONP and datalad.

Thank you!

cmadjar commented 4 years ago

@akhanf actually, if the dataset that was used to produce the registration outputs is different from the BigBrain dataset hosted on CONP, it would be best to create a separate CONP dataset with the raw data that was used to produce your results.

If there is that need, would you be able to create such a dataset for the CONP portal.

Let me know. Thank you.

kaitj commented 4 years ago

@cmadjar @Martybird @akhanf - Dataset has been updated at the source (khanlab-datasets) and a PR has been opened to merge into conpdatasets (https://github.com/conpdatasets/BigBrainMRICoreg/pull/2)

Martybird commented 4 years ago

Thank you, Jason!

cmadjar commented 4 years ago

@akhanf @Martybird Your PR just got merged today meaning the BigBrainMRICoreg dataset is up to date for CONP.

Should the processed data in BigBrainMRICoreg be linked to the native data stored in the CONP BigBrain dataset or did you use different raw files to produce the processed data?

Let me know. Thank you!

Martybird commented 4 years ago

Hi @cmadjar ,

The resulting data were all processed with MINC2 format of the BigBrain dataset, and then converted to NIFTI. The final repository contains both data types, but the spatial transformations are only available in MINC format.

cmadjar commented 4 years ago

Hi @Martybird,

Thank you for the quick reply.

So if I understand correctly, we could link the BigBrain dataset with the BigBrainMRICoreg dataset on the CONP portal as raw and derived datasets respectively?

Thank you!

Martybird commented 4 years ago

@cmadjar Yes.

cmadjar commented 4 years ago

great! Thank you so much!

glatard commented 4 years ago

@Martybird did you use the version with 125 40-um blocks?

Martybird commented 4 years ago

@glatard No, I didn't. I used the previous ICBM registration volume by Claude at 100um and 300um.

glatard commented 4 years ago

so BigBrainMRICoreg isn't derived from the BigBrain dataset we have on CONP. Do you know if that volume that you used is available online anywhere?

Martybird commented 4 years ago

@glatard I haven't looked closely on the BigBrain dataset on CONP. The volumes are available at the BigBrain official website https://bigbrain.loris.ca/main.php?test_name=brainvolumes&release=2015

glatard commented 4 years ago

Thanks to all who contributed, correct me if this is not complete but I think there are now:

We should now think of representing relation "is derived from" in the datasets and portal. The main use-case for that would be for users to be able to list the dataset(s) derived from a given parent dataset, or to list parent dataset(s) for a given derived dataset.

DataLad already has a mechanism for that, through sub-datasets: parents of a derived dataset should be added as sub-datasets of the derived dataset. In this way, derived datasets can "declare" the parents that were used, without having to update the parent datasets themselves.

So I think the next steps on this topic should be:

  1. Add parents(s) dataset(s) to all datasets derived from BigBrain. I guess this is for @emmetaobrien
  2. Specify a graphical way in the portal to list parent and derived datasets of a given dataset. An easy way would be to just add links to the data page, a better way might be to represent the datasets as "threads" (as in email threads) in the front data page. @3design could be involved here if he has time.

About point 2, it should be noted that while listing parents would be straightforward, listing derived datasets of a given parent requires to go through all the datasets in the platform. This should probably be done aynchronously (using a cron job), to keep the interface responsive.

glatard commented 4 years ago

Tagging @jbpoline as this comes from a discussion we recently had together.

emmetaobrien commented 4 years ago

@glatard: Your summary above is accurate.

There is nothing in DATS to specifically record an "isDerivedFrom" relationship, we would have to set that up ourselves under extraProperties, but that would be straightforward to do. Alternatively, the "YODA" data organisation principles mentioned in the DataLad handbook involve storing "isDerivedFrom" implicitly, in structures where a source dataset is always linked as a submodule of a derived dataset; that looks like it might be a bit fiddly to set up, but would only get more so with time.

glatard commented 4 years ago

Hi @emmetaobrien, yes, the YODA recommendation is what I thought we should do in point (1) of my post above (I didn't know it was part of YODA, thanks!). At this stage, I'm not sure if we should add this information in the DATS model, duplicating information in a possibly inconsistent way, or just rely on sub-datasets to store the derivation relation.

emmetaobrien commented 4 years ago

I lean slightly to storing the information in DATS because it would be more explicit and not require users to be familiar with YODA, but am open to arguments either way.

glatard commented 4 years ago

I think we should encourage putting parent datasets as sub-datasets no matter what, as it makes it way easier to access the parent data to understand and possibly reproduce the dataset. I'd say the question is whether we want to represent that in the DATS model in addition.

glatard commented 4 years ago

Discussion from the meeting today: we will use both YODA and a property in the DATS model. The DATS model will include URLs of the Git submodule(s) found in the dataset. @emmetaobrien will create an example for that.

emmetaobrien commented 4 years ago

There is now an example at https://github.com/conpdatasets/BigBrain_3DSurfaces, which has both the raw BigBrain dataset as a submodule, and the field extraProperties->derivedFrom containing a link to the BigBrain dataset.

glatard commented 4 years ago

Awesome! Does it require any update in the DATS schema? I think we could then PR that to the main dataset, and pass it on to @xlecours, @liamocn and @3design to represent this information in the portal.

emmetaobrien commented 4 years ago

If this seems OK to everyone, I will update the documentation of our DATS schema accordingly.

I have also just done an equivalent update to https://github.com/conpdatasets/BigBrain_3DClassifiedVolumes

emmetaobrien commented 4 years ago

The documentation has now been updated. https://github.com/CONP-PCNO/conp-documentation/pull/32

glatard commented 4 years ago

Thanks for all this @emmetaobrien! I guess this is now ready for a PR to conp-datasets? Derived information would then be ready to be displayed in the portal.

emmetaobrien commented 4 years ago

https://github.com/CONP-PCNO/conp-dataset/pull/275 is now ready for review.

glatard commented 4 years ago

Thanks @emmetaobrien for all the work in #275. The next step now would be to represent links between derived datasets and their parents in the portal. An easy way to do so would be to just have a list of parents and a list of children dataset in the dataset page. Maybe we could also add an icon on the front page to show that a dataset has x children.

I think we need @3design to chime in here and let us know how this information could be represented.

3design commented 4 years ago

This is quite interesting. Is there a way we can show it in a kind of relationship flow chart (reminds me of a file browser structure)?

A typical relationship chart may be helpful:

parent 1 L child 1 L child 2 (current dataset)

Of course, it gets a bit more complex if you need to show relationships where multiple datasets are used for the derived:

parent 1, parent 2, parent 3 L child 1, parent 2 L child 2 (current dataset)

I think the above would be acceptable, especially if each dataset is clickable so that they can be found in the portal or linked outside.

It would be nice it it could be shown in a more dynamic way such as connected nodes (though this might be well beyond our current scope):

https://dist.neo4j.com/wp-content/uploads/20160202125930/cypher-query-data-relationships-nicole-white-graphconnect.png

https://wp-assets.highcharts.com/www-highcharts-com/blog/wp-content/uploads/2019/12/10175015/Which-charts-are-best-at-showing-data-relationships-2.jpg

glatard commented 4 years ago

Indeed, a "file browser" structure might not work when a dataset has multiple parents. I had thought about an "email thread" kind of representation earlier but it has the same issue (derived datasets are a graph, not a tree). Your idea to have a more visual representation is awesome, I think it would be very impressive to give such a (clickable) high overview of relations between datasets. Here is another example of a possible visualization: https://brain-web.github.io/community/

Maybe we could break this down in the following 2 steps to make it more tractable?

  1. Implement a simple "icon and HTML links" representation in the portal
  2. Go for an (interactive) graph representation What do you think?
3design commented 4 years ago

The visual representation in your link is very nice. I think this would be very impressive as long as its also still usable and valuable for the user.

I think your steps would be good. A basic view and a more relationship-based view. (we can give the user a choice to load the relationship view)

glatard commented 4 years ago

OK! So how do you think the design of step 1 could look like?

3design commented 4 years ago

I've been looking at https://fairsharing.org and they are using a 'chip' based system for their tags, etc.

https://fairsharing.org/FAIRsharing.jptb1m

maybe this can be adapted to our use:

Parent datasets: Chip 1, Chip 2, Chip, 3

This doesn't show the whole history, but it does show one step up (parents).

What are your thoughts? Is it enough to show the parents and have them clickable so that the user will at least see what info is in the parent? Over time as our DB grows, this will become more meaningful as each parent will link to other datasets in our system (as well as outside).

As a falloff, all this trace connection may be usable when generating the interactive graph.

glatard commented 4 years ago

Hi @3design, this sounds very good to me for a first step! I guess this could now be sent to @liamocn to implement? @liamocn, to summarize the discussion above, the goal would be to:

cmadjar commented 4 years ago

@emmetaobrien @3design @liamocn Can you guys work together to make that happen on the frontend before the end of June 2020?

I think there is a plan for @liamocn to start implementing on the portal.

Thank you!

liamocn commented 4 years ago

@glatard @emmetaobrien Currently the value of the parent dataset is a URL. I can use this to create a link, but I have no way of labelling the link aside from with the url itself, as I can't reliably identify what the parent dataset actually is. Also this link will be an external one and will not link to that dataset on the portal, for the same reason. Is all this intended?

cmadjar commented 4 years ago

example of the content of the DATS.json file:

    {
      "category": "derivedFrom",
      "values": [
        {
          "value": "https://github.com/conpdatasets/preventad-open/tree/acee97ba0ec6bb2398d69d519dd4be1cf710ac48"
        }
      ] 
    }

This is nice to be able to track the exact commit that it is derived from but maybe we could add the dataset name in that so that Liam could link those in the portal?

Maybe something like that?

    {
      "category": "derivedFrom",
      "values": [
        {
          "parent_dataset_id": "preventad-open",
          "value": "https://github.com/conpdatasets/preventad-open/tree/acee97ba0ec6bb2398d69d519dd4be1cf710ac48"
        }
      ] 
    }

@liamocn if there was a "parent_dataset_id" or "parent_dataset_name", would that work with you? @glatard @emmetaobrien thoughts? Other suggestions?

liamocn commented 4 years ago

Yes that's sort of what I was thinking of. This also helps with identifying child datasets. Of course if I can assume that all parent datasets are going to have urls of the form https://github.com/conpdatasets/:id/... then I can slice the id out but that will break if the assumption isn't true for all cases.

emmetaobrien commented 4 years ago

@cmadjar: Your suggestion sounds very sensible to me. What's the most straightforward value to include there for lookup, would just the dataset name suffice ?

cmadjar commented 4 years ago

@emmetaobrien what is being parsed by the portal to populate the database is the name of the submodule in .gitmodules.

https://github.com/CONP-PCNO/conp-dataset/blob/master/.gitmodules

So for PREVENT-AD BIDS, the parent dataset name would be preventad-open but for a dataset derived from BigBrain then one would need to add bigbrain-datalad as the parent dataset name.

I can send a PR to add that for the PREVENT-AD BIDS dataset. Maybe @emmetaobrien you can take care of the other ones?

cmadjar commented 4 years ago

@liamocn the DATS.json of preventad-open-bids has been updated with the new key so that you can work on the portal design again. Thank you!

emmetaobrien commented 4 years ago

@cmadjar : I have just updated the BigBrain derived datasets accordingly.

cmadjar commented 4 years ago

I pulled the code from @liamocn on the portal and we can now see the child datasets and the parent datasets on the appropriate datasets.

@glatard @emmetaobrien is there anything else I am not seeing that need to be done for this issue? Thank you!