elixir-europe / plant-brapi-to-isa

BSD 3-Clause "New" or "Revised" License
8 stars 6 forks source link

Improve performances by having one assay by level #70

Closed cpommier closed 1 year ago

cpommier commented 3 years ago

The current bottelneck is the number of assays. There is curently one assy per observationunit. MIAPPE allows to have one assay per level. By handling only one observation unit per level in brapi_to_isa line 95 : def create_study_sample_and_assaythe performances are dramatically improved. But, the germplasm list isn't present in the study file anymore.

What we don't understand, is that the germplasms should be handled by the following piece of code, line 534:

                for germ in germplasms:
                    # Creating corresponding ISA biosources with is Creating isa characteristics from germplasm attributes.
                    # ------------------------------------------------------
                    source = Source(name=germ['germplasmName'], characteristics=converter.create_germplasm_chars(germ))

                    if germ['germplasmDbId'] not in germplasminfo:
                        germplasminfo[germ['germplasmDbId']] = [germ['accessionNumber']]

                    # Associating ISA sources to ISA isa_study object
                    isa_study.sources.append(source)

What are we suppose to do to ensure that the germplams end in the right study charachteristics ?

cpommier commented 3 years ago

@bedroesb and @proccaserra , what is your opinion on this ?

bedroesb commented 3 years ago

@cpommier Could you send the changes you propose in def create_study_sample_and_assay so I can have a look how to fix it? Because to be honest if I look now at the code it is difficult to see immediately why certain things were done (in a hacky way) :)

proccaserra commented 3 years ago

@cpommier as far as I remember (It is already a long time ago), we had the discussion about the which 'observation level' to support from BRAPI to ISA. So we need to go back to the mapping between BRAPI and ISA entities. I am confused by the discrepancy in Line numbers you reference, but I guess we are talking about the code which tries to create ISA Source and Sample based on BRAPI information obtained from GermplasmID associated with ObservationUnits , and then do a lookup on the characteristics (invoking create_germplasm_chars(germ) ). It seems the refactoring now fails to retrieve the information. But a call would be best to discuss this because too much time has past since I worked on this (I am not working with BRAPI every day and my memory needs to be jogged to get the BRAPI model back!)

cpommier commented 3 years ago

Thanks @bedroesb and @proccaserra ! The changes I have in mind are not implemented yet, I am still cautious, because they might violate the isatools model. Therefore I am sorry but there is nothing @erikkimmel or me could send you :(

A quick call might indeed be better, I'll send you both an email and we will see what we can do, thanks!

Curently, the code works, but is not too slow. The curent implementation pseudo code is the following

for germ in  germplasm:
    study.source.add(germ.characteristics)

create_study_sample_and_assay (study, observation_unit)

def create_study_sample_and_assay(study, observation_unit):
  for obs_unit in  observation_unit:
      study.assays.add(create_sample_from_obs_unit (obs_unit))

I hope I am not oversimplyfing things. The problem is that only the sources that are attached to an assay are added in the study file. I would like to dump all the sources even if they are not explicitely used in an assay. Is that possible @proccaserra or did I misunderstood isatools logic ? What I would like to do is more in the line of :

for germ in  germplasm:
    study.source.add(germ.characteristics)

create_study_sample_and_assay (study, observation_unit)

def create_study_sample_and_assay(study, observation_units):

     # the following shoudl'nt be attached to any specific source. Or more precisely, it should be attached to all of them.
      study.assays.add(create_one_assay_per_level) # the level is plant, micro plot, block, etc...
      create_data_file(observation_units)

Does that make sense ?

proccaserra commented 3 years ago

@cpommier, I ran a quick test on a notebook with the isa-api. If an isa.Source is declared but not used in any protocol to create another materials, even though the isa.Source and associated to a Study, the serialisation to ISA-tab does not write that source. The ISA-JSON serialisation is fine shows the unused Source. So it seems we need to look at the ISA-Tab serialisation code. @djcomlab @zigur

cpommier commented 3 years ago

Thanks a lot @proccaserra

djcomlab commented 3 years ago

So this is kind of expected behaviour and down to design decisions in ISA API.

The ISA-Tab tables serialization focuses on describing the experimental workflow, and if there's a source declared that is not used in a workflow, then the assumption is has been that it has not used in the experiment and therefore not rendered in a table.

ISA-JSON, however, has no such constraints as the data model has the structures to cleanly hold sources independent of the experimental workflow.

cpommier commented 3 years ago

Thanks for the clarification @djcomlab . I understand the logic. Would it be possible to trigger the ISA-JSON beaviour in ISA-Tab dumps using a dedicated option ?

djcomlab commented 3 years ago

No, I don't think this would work as things are at the moment.

The problem is that in ISA-Tab there is no separate "container" for material nodes (Source, Sample, other stuff) to exist independently from the experimental workflows.

It is of course possible to just add sources into a study table file with no linked processes and derived materials or data etc. but in ISA API the design assumption has been that this would not be correct, and I suspect this raises errors at validation time (disclaimer: but I have not tried it!).

cpommier commented 3 years ago

Ok, but would it be possible to attach all the sources to a single assay ?

cpommier commented 3 years ago

In https://github.com/elixir-europe/plant-brapi-to-isa/blob/master/brapi_to_isa.py lines 103 and following, we create one sample attached to one source, that gets attached to the study line 154

 isa_study.samples.append(this_isa_sample)

Then it get attached to a dedicated assay again line 168: isa_study.assays[i].samples.append(this_isa_sample)

If we have only ` isa_study.samples.append(this_isa_sample) and create a single assay, woud that work ? @erikkimmel, what do you think ?

djcomlab commented 3 years ago

For the sources and samples to appear in the tables they must be associated with a process.

e.g. at the study-level, source -> process -> sample must be described for it to appear in the study table.

Similarly to the sources in the sources container in the model and ISA-JSON not coming out in the ISA-Tab study tables, samples must also be part of an experimental workflow (i.e. part of a process sequence) to appear in the assay table files. @proccaserra can probably elaborate on this.

cpommier commented 3 years ago

Thanks, that clarifies the choices that were made

proccaserra commented 3 years ago

@cpommier so I ran a few more test this morning, writing up variations of ISA objects and now confirm that:

I'll document all that in a jupyter-notebook

proccaserra commented 3 years ago

@cpommier a suggestion: if brapi2isa code can identify 'unused Sources', then a possible solution would be to create "dummy" Samples so the ISA-Tab serialisation dumps the ISA.Source Information while clearly marking the ISA.Sample as "dummies". This could be done using a special string (e.g. string as value for the Sample Name), complemented by the addition of an ISA.Comment element qualifying said ISA.Sample object to make it clear to human agent. Of course, this is 'sugar' and other tools supporting the format wouldn't know how to deal with that but it would possibly address your needs.

djcomlab commented 3 years ago

A question:

Are the source->sample, and sample->data workflows basically "flat" in all cases here?

From what I can see in the code it looks like there is no branching or pooling. The ISA API's ability to deal with complex graphs is what slows it down.

proccaserra commented 3 years ago

@djcomlab, based on the set of converted brapi studies, the graph complexity is very low. The amount of information in the ISA assay table is actually very low.

djcomlab commented 3 years ago

@proccaserra That's what I figured. I could probably put together a fast custom ISA-Tab writer just for this, until we address the overarching ISA-API performance. The idea being that ISA API is more like the Swiss Army knife that we can use to do many things with, but not necessarily all of these things optimally in all cases, but when we want to do one specific kind of task on a larger scale, we need to craft something optimised the task.

cpommier commented 3 years ago

@proccaserra @djcomlab , thanks! We had the same idea @erikkimmel and I and even tried an implementation, but it prooved to be too much work for us since we don't know the isatool library as well as you. If you want to do a fast custom ISA-Tab writer we would be very happy to test it, either on the brapi2isa master branch or on our tentative optimisation branch (https://github.com/elixir-europe/plant-brapi-to-isa/tree/test/single_assay_by_level). In parralel, we could try to review that branch with @bedroesb . What do you think, should we have a chat before begining ?

Anyway, thanks a lot for proposing this optimised writer, it would be highly usefull!

erikkimmel commented 3 years ago

Good morning all, With the code we modified with @cpommier on BrAPI2ISA in this branch, I ran some performance tests. On a medium-sized dataset, I obtained a gain of around 25% (4h -> 3h). Have you had time to look for the possible writing of a custom ISA dumper? Thank you in advance for your answers. Cheers

cpommier commented 3 years ago

We just reviewed the code and here is the current logic and todo list:

erikkimmel commented 3 years ago

Hi, Here are the command lines I used to run my tests. This could be usefull if you want to run them on your own side. I run each command one time on the BrAPI2ISA master branch and on test/single_assay_by_level branch the second time.

on master

real 1m45.905s user 1m38.270s sys 0m0.927s

on test/single_assay_by_level

real 1m30.787s user 1m22.485s sys 0m0.878s

- bigger dataset

python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==

on master

real 276m49,662s user 267m20,711s sys 0m22,956s

on test/single_assay_by_level

real 186m11,643s user 177m29,317s sys 0m7,336s


 Regards
bedroesb commented 3 years ago

@erikkimmel which isa-tools version did you use?

erikkimmel commented 3 years ago

@bedroesb :

$ pip3 freeze | grep isatools
isatools==0.11.0
bedroesb commented 3 years ago

with isa-tools 12 i hope to finally solve #62 and to have improved isa json dumping, lets see! Do we after testing on the VIB dataset merge the test/single_assay_by_level branch into master ? or do you want to work on the other points first? @erikkimmel

cpommier commented 3 years ago

the changes proposed in test/single_assay_by_level could impove performances but have unwanted side effects. We should I think first check the impact of isatools 12, then asses the impact of this branch on metadata. It is still work in progress and there might be undesirable side effects. We can cope with it as long as there is a significative speed imporvment between master and test/single_assay_by_level. If that's not the case, the fast dumper suggested by @djcomlab might be more advisable. Thanks for your interest and help/contribution :)

erikkimmel commented 3 years ago

@bedroesb, @cpommier I started to re-run my tests with the new 0.12.0 version of isatools. I'll keep you informed.

erikkimmel commented 3 years ago

@bedroesb, @cpommier here are the results of my tests using the 0.12.0 version of isatools: I used the following command:

python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==
cpommier commented 3 years ago

Thanks Erik. isatools 0.12 is definitely an improvement. The single assay by level approach can take advantage of it, but 143m is still rather long. This is only a 3 year, 12 locations, 300 samples dataset. Extract it in a few minutes, without validation, would be really a great thing. Considering the situation, reviving the fast writer hypothesis, even if it is not that easy to write, could be the most profitable approach. What do you think @proccaserra @djcomlab ?

erikkimmel commented 3 years ago

I ran the tests on the master branch of BrAPI2ISA with the 0.13-rc2 version of isatools:

time python3.8 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==
[...]
real    8m33,605s
user    1m3,299s
sys     0m2,000s
bedroesb commented 3 years ago

this is such great news! Thanks for testing

cpommier commented 1 year ago

The one assay per level seems to be a bad approach. Performances improvement is still important, but probably with other approach, using latest isatools mainly.