alpha testers? - Githubissues

adswa commented 5 years ago

@mih, I have an alpha version of multimatch now! There is a bit of documentation here, the package is installable via pip install multimatch (and for the little it does, it seems to be working), there are some tests. At this point, I'm suffering a bit from funnel vision. There's probably a ton to be improved or user-unfriendly, and also, I haven't shown the code to anyone yet. Do you know anyone who might be a naive alpha-tester for this to point me towards bugs, deficiencies in the documentation, general improvements (to then do improvements in code with improvements in user-friendliness in mind), or do you have any other suggestion on how to proceed? I've also created a backbone for a submission-attempt at JOSS - once this repo is in a 'good-enough' state, a submission looks like a thing of half and hour.

mih commented 5 years ago

I'll take a look later this week.

adswa commented 5 years ago

(No rush)

mih commented 5 years ago

With the fix in #8 I could install it from github in a VirtualEnv (does not have pandoc by default). Congrats on putting this together!!

Few comments/suggestions (with highly a subjective touch, focusing on the "what-would-I-have-done-differently aspect" (not saying "better" as I have yet to grasp much of what is being done, before I can afford an opinion).

I prefer long option names without underscores: --direction_threshold -> --direction-threshold
I would not have anything study-specific in a tool repo (i.e. no studyforrest). I haven't looked closely yet, but the bridge seems to be a "converter" that turns remodnav event segmentations into the fixation sequences required by the rest of the code.
description of datafile content doesn't mention unit:

tab separated file with columns corresponding to x-coordinates, y-coordinates, and fixation duration in seconds

coordinates in pixels?
related: I am surprised that there is --screensize, but no way to specify the viewing distance (did not look into the paper yet, hence I don't know if needed, but I would expect threshold etc. to be in visual degrees. I looks like here things need to be all in pixels. Not a big problem, but if I want to run to analyses on the, say, lab-sample of studyforrest vs the MRI-sample, and I want to how the same setup, I would have to use two-different sets of parameters, because sizes and distances differ.
it would be good if code comments shed some light on the intention of what is being done and/or the data structures. Example:
```
    # initialize empty lists
    sim_dur = []
    sim_x = []
    sim_y = []
    sim_lenx = []
    sim_leny = []
    sim_theta = []
    sim_len = []
```
The present comment is not really needed. it is fairly obvious that this is the case. What would be intersting here (and elsewhere) is to know why. There are some non-trivial item-access operations (i.e. in simlen() with two indices that go front and back, where I have little intuition what is supposed to happen. A comment up top would help.
related: the army of lists in simlen() smells like a suboptimal data structure. It seems, together, they contain properties of events, and theyy all have to be manipulated in-sync (which is rather error prone). I'd suggest to think about turning the data structure in a single list of events, where each item is a dictionary with the relevant properties (given by a human-readable name). A similar change might be useful for data, which is now accessed by eight elements of a "magic" first axis and them by event on the second axis. It would make sense to handle that vice versa too: first axis events, and then properties by name (names are already there, but only in the docstring of gen_scanpath_structure -- it would be generally useful to use them instead of numerical indices for readability)

Is that helpful?

adswa commented 5 years ago

Few comments/suggestions (with highly a subjective touch, focusing on the "what-would-I-have-done-differently aspect" (not saying "better" as I have yet to grasp much of what is being done, before I can afford an opinion).

Subjective suggestions and comments are the best - they give me a chance to at one point also have subjective opinions instead of no opinion at all.

I prefer long option names without underscores: --direction_threshold -> --direction-threshold

cool, will change that.

I would not have anything study-specific in a tool repo (i.e. no studyforrest). I haven't looked closely yet, but the bridge seems to be a "converter" that turns remodnav event segmentations into the fixation sequences required by the rest of the code.

Yep, thats correct. I'll strip it down to only multimatch.

description of datafile content doesn't mention unit: tab separated file with columns corresponding to x-coordinates, y-coordinates, and fixation duration in seconds coordinates in pixels?

Good that you noticed that, will add the information

related: I am surprised that there is --screensize, but no way to specify the viewing distance (did not look into the paper yet, hence I don't know if needed, but I would expect threshold etc. to be in visual degrees. I looks like here things need to be all in pixels. Not a big problem, but if I want to run to analyses on the, say, lab-sample of studyforrest vs the MRI-sample, and I want to how the same setup, I would have to use two-different sets of parameters, because sizes and distances differ.

Now that you say it, I'm also confused why this isn't anywhere. The code is at least mirroring the matlab implementation. To be entirely sure I checked the Matlab code and the corresponding publication, and there indeed is no viewing distance mentioned. Not even in the validation experiments (Dewhurst, 2012) method description is a viewing distance described anywhere.

Participants and apparatus A group of 20 participants (9 females, 11 males; 26.9±5.3 years old) generated scanpaths by looking at stimuli in an eyetracker. All participant had normal or corrected-to-normal vision. In addition, ideal scanpaths were synthesized by assuming that a single par- ticipant perfectly performed the task of looking at points in the given order. The stimuli were displayed on a Samsung Syncmaster 931c TFT LCD 19-in. (380 × 380 mm) screen running at 60 Hz, with a resolution of 1,280 × 1,024 pixels. The stimuli were presented using MATLAB R2009b and the Psycho- physics Toolbox (Brainard, 1997). Binocular eye movements were recorded at 500 Hz with the SMI HiSpeed system and iView X 2.5, using default settings.

I'm not sure, but doesn't this make any comparison between data obtained with different viewing distances problematic? How would a viewing distance be possibly incorporated into a screensize measure? Not that this would be a deficiency of this reimplementation, rather a deficiency of the method itself.

it would be good if code comments shed some light on the intention of what is being done and/or the data structures. [...]

will do!

related: the army of lists in simlen() smells like a suboptimal data structure. It seems, together, they contain properties of events, and theyy all have to be manipulated in-sync (which is rather error prone). I'd suggest to think about turning the data structure in a single list of events, where each item is a dictionary with the relevant properties (given by a human-readable name). A similar change might be useful for data, which is now accessed by eight elements of a "magic" first axis and them by event on the second axis. It would make sense to handle that vice versa too: first axis events, and then properties by name (names are already there, but only in the docstring of gen_scanpath_structure -- it would be generally useful to use them instead of numerical indices for readability)

haha, yes, that makes sense. I think is very visible in the code that I started writing this code with zero knowledge of anything and then sticked to the very first way it worked for the first time without thinking/caring about much more than "phew, thank god it works". I will change that to the proposed data structure and make sure I index more readable.

Is that helpful?

It is super helpful, thank you so much!

adswa commented 5 years ago

as an update:

multimatch_forrest has no entrypoint anymore now. The Docs are multimatch_forrest-free.
The suboptimal nested list structure is an OrderedDict now - THANKS for this advice. everything is so much neater now.
cmd line arguments have dashes instead of underscores now
I wasn't able to advance the simplification functions simlen() and simdir() yet, apart from the fact that they now in- and output an OrderedDict as a datastructure and use a small helper function for a reduction in repetition.

adswa / multimatch_gaze

alpha testers? #7