Export data should have more descriptive filename

tymees commented 2 years ago

Currently it only contains the internal number of the datapoint. This was done in order to avoid having to look at the data itself to generate a filename.

However, it has been reported that this isn't as useful to the researchers that analyse the data, as there is no relation between filename and data. Thus, we should try to add a more meaninfull identifier to the filename.

The main problem we need to solve is the fact that we cannot 100% rely on the supplied data to follow the same conventions. As such, we need to a) be fault tolerant and b) provide a fallback.

tymees commented 2 years ago

My suggestion would be to look for a column named 'subject', all our templates (should) add this field. Maybe even make it configurable during experiment registration in the app?

As for the two requirements: it should not be hard to wrap everything in a try/except block; as for the fallback, my suggestion would be to use the old method as the fallback.

bbonf commented 2 years ago

We could possibly add a subject id to the new session API, and use that in the filename when it's available

tymees commented 2 years ago

Hm, that could work. However, I'd also add it to the existing upload endpoints as an optional parameter. That way you have two workflows:

If not using the session API, one can specify the subject ID at upload time (ideally as an optional parameter)
If using the session API, you'd (optionally) specify the subject ID at session creation time.

Another way to solve our problem might be to add an optional 'filename' parameter to the upload API; that way someone can upload multiple sets of data and name them in a way the researcher knows what the different files are.

tymees commented 2 years ago

We've decided to do the following:

Make it mandatory to supply a subject ID when setting up a session
Add an optional parameter to the non-session upload method to supply a subject ID.

To do this, we need to do the following:

For the non-session method:

Add a 'subject_id' field to the DataPoint model, which will store that piece of metadata
Add parameter to the regular upload API endpoint to supply the ID and update the view to save that ID on te Datapoint
- It's probably the easiest to provide the id through a path variable, like access_key is already
- As it's optional, this should be done with a new url path. (So that we have two upload URL's, one with the path variable and one without)
- To use the new path variable, one adds a new method parameter to UploadView.post with a default value (I would use None). The default value is needed as otherwise Django will error when using the URL path without the new variable

For the session method:

Right now the datastore already assigns a participant ID automatically (in UUID format). Thus we have multiple options to aproach this:

Add a new subject_id parameter alongside the participant_id.
Rewrite the UiL utils (and experiments using the utils) to use the subject ID generated by the datastore instead of the JSPsych generated one. (It's already sent by the session API, so it should be somewhat easy to do this)
Rewrite the datastore to use a subject ID sent through the API instead
- This would require the participant ID field to be changed to a regular text field
- In addition, the utils need to be rewritten to sent the locally generated ID when creating the session, but that should be about it (as the code saves the subject ID and adds it to the upload call already)

Option 1 is not ideal imo, as that would result in two fields that are similarly named and fulfilling similar goals.

As for option 2 vs 3, it's more of a question on who should be responsible for determining the subject ID. You could make the case that it's better to have the datastore be authorative, as it knows more about the other participants. (And thus could better avoid collisions). On the other hand, allowing the experiment to determine it would allow for more flexibility in the format of the subject ID.

(Another point against 2: UUID's are great, but also horrible to parse as humans. So it might not be as userfriendly as a shorter ID).

I'll leave it up to the implementer to decide.

In either case the code for sessions doesn't require that much updating, as it already uses subject ID's. The only thing to add is a way to access the ID from the datapoint. This, again, has multiple options to accomplish this:

Use the DataPoint.session and read the subject ID through that link
- It would probably best to create a property or helper method on DataPoint to check where the subject ID is stored (on the DP itself, or through an attached session) and retrieve it. This way, other code that wants to use this ID can just call that instead of having to figure out where to retrieve it from
When saving the DataPoint, fill in the subject_id field like the non-session method.

Option 2 is by far the easiest, as one can just modify BaseUploadView._save_data_point to accept an optional parameter. (And from a complexity perspective it would probably be the best). However, it would result in saving data in multiple places, so it's maybe not the best option in that regard.

(There's a hidden option 3: create a session record for non-sessioned uploads as well and store the subject ID there in both cases. I don't really like that option, as it would add/save more data than necessary in some cases. But I'm not against it if one decides to go for that as well).

For both methods

To actually resolve the core issue, both download functions (experiments/utils/download.py) need to be updated to use the new field to add the ID to the filename. This shouldn't be too hard, but keep in mind to check if there actually is a subject ID to add (as it's technically still optional). I would suggest just adding the subject-ID to the current filename format.

tymees commented 2 years ago

Well that took way too long and might be overly descriptive. Please tell me if anything is unclear

maartenuni commented 2 years ago

The descriptions above are pretty clear. However, I'm missing a bit of the point or am a little confused - possibly both. Should the research participants fill out their name, our use another id? As it is in the jsPsych boiler plates do something like:

let subject_id = jsPsych.randomization.randomID(8);

But when we substitute the current number with the number as generated by jsPsych above, than we replace one meaningless number by another. So coming back to the original question, what kind of meaning full identifier are we talking about?

tymees commented 2 years ago

The main point, as I gather, is that there's no relation between the contents of the file and the filename. Quote from the mail:

de csv-databestanden zijn genummerd gewoon met een volgnummer, dat blijkt in de praktijk ook behoorlijk onhandig (want dat nummer is in het bestand zelf nergens terug te vinden, dus als je een bepaalde pp weg wilt gooien ben je lang bezig om het juiste bestand te vinden).

But when we substitute the current number with the number as generated by jsPsych above, than we replace one meaningless number by another. So coming back to the original question, what kind of meaning full identifier are we talking about?

Yes, I see your point. We are definitly replacing one meaningless string with another; in my opinion however, it's still a improvement in two ways: 1) at least the meaningless string is the same everywhere and 2) it would make it possible for the experiment to decide how meaningless it should be (to the point that the experiment gets the freedom to just use a name or something. (Well, the last one would be a benefit if we implement it that way).

That does remind me; we probably should be encoding the ID on the jspsych-utils using urlencode if we're going to use path variables

maartenuni commented 2 years ago

Ok, one more question. In theory the javascript above that generates the subject_id, could generate the same number twice. It's unlikely, but... Do we want to something to prevent that to happen. If we lump the original internal number and the subject_id in the final name, we end up with different names, is that good enough? If we would always use the session api we could enable a counter, that does ParticipantSession.count += 1 up for each created ParicipantSession and use that as an id. I think every Model has a default pk member right, which does something like this right? I'll try to work out both methods anyway as it is a good exercise.

tymees commented 2 years ago

It could happen, yes... I think your proposal is a great idea.

If we would always use the session api we could enable a counter, that does ParticipantSession.count += 1 up for each created ParicipantSession and use that as an id. I think every Model has a default pk member right, which does something like this right?

The DataPoint model already has a counter, so can find an example of that in the code already. It's implemented through a simple PositiveInteger field on the model, and a signal to set it. (See experiments/signal.py). A signal is a calback function which can be set to triggered upon certain actions; in this case it's called just before a DataPoint is saved. This advantage of using a signal to set the counter's value is not having to remember setting it everywhere one creates a DataPoint.

Alternatively, one could just use the default pk value. It's way easier, as it already exists. The only disadvantage I can think of is the fact pk is a global instance counter, so one could get Session 1, 52, 88 instead of Session 1, 2, 3 in an experiment. If you're only appending the pk for uniqueness sake, I think it would be fine. For actually displaying the number by itself, not so much.

maartenuni commented 2 years ago

Shall I close this issue?

tymees commented 2 years ago

Yes!

maartenuni commented 2 years ago

This feature hast been merged inhttps://github.com/UiL-OTS-labs/web-experiment-datastore/commit/0869a39b0e7c4163670d5c3667f23ba8fe30cfbb

UiL-OTS-labs / web-experiment-datastore