Currently, all files are saved in the same data folder (data)
Different parts of the tutorial can use files with the same name
And the expectation for those files may be different
E.g part 3.1 saves a simple vocab which is 3D
And part 4.2 expects a dimensionality of 300
Changes may be made to the files by different parts
And these changes may not be compatible with the expectations of other parts
When running parts of the tutorial multiple times, data needs to always be redownloaded
The "problem" I've described above is, of course a sort of byproduct of me trying to run the tutorials multiple times and commenting out the download parts (which take a long time for me, at least when working from home).
If I wasn't trying to do that, there would be no incompatibility.
So my current "solution" is to create a separate folder for each part of the tutorial (i.e data_p2). And on top of that, to not download the files unless they are newer than local (the -N flag).
Overall, the approach I've gone for has a number of pros and cons. I will try to list them here to the best of my abilities.
Pros
Cons
Downloads are not repeated if file exists locally
Extra disk space is required on the user's side (in my case, it took around 6GB total)
Files are not shared between different parts of the tutorial
If files get changed locally, they will not automatically be redownloaded
The current approach was to apply the -N flag to all wget downloads. If there are files for which this is not appropriate, these changes would need to be reverted.
PS:
The problem arose, as I mentioned above, when I was trying to run the tutorials multiple times. The idea was to see how the tutorials fared with the current master branch of MedCAT. And because I had already downloaded the files in prior runs, I commented them out. This lead to the issue where Part 4.2 would read the file written by Part 3.1 and there would be an incompatibility.
PPS:
This may not be the ideal solution to the "problem" at hand. And I'm more than open to other ideas and/or the dismissal of this PR.
The rationale behind this PR is as follows:
data
)The "problem" I've described above is, of course a sort of byproduct of me trying to run the tutorials multiple times and commenting out the download parts (which take a long time for me, at least when working from home). If I wasn't trying to do that, there would be no incompatibility.
So my current "solution" is to create a separate folder for each part of the tutorial (i.e
data_p2
). And on top of that, to not download the files unless they are newer than local (the-N
flag).The current approach was to apply the
-N
flag to allwget
downloads. If there are files for which this is not appropriate, these changes would need to be reverted.PS: The problem arose, as I mentioned above, when I was trying to run the tutorials multiple times. The idea was to see how the tutorials fared with the current master branch of MedCAT. And because I had already downloaded the files in prior runs, I commented them out. This lead to the issue where Part 4.2 would read the file written by Part 3.1 and there would be an incompatibility.
PPS: This may not be the ideal solution to the "problem" at hand. And I'm more than open to other ideas and/or the dismissal of this PR.