feature request / suggestions

gbouvignies / ChemEx

ChemEx is an analysis program for chemical exchange detected by NMR.

https://gbouvignies.github.io/ChemEx/

GNU General Public License v3.0

28 stars 14 forks source link

feature request / suggestions #11

Closed reneeotten closed 6 years ago

reneeotten commented 9 years ago

hi Guillaume and co.,

I have been using ChemEx for a while now (from the "develop" branch) and it's very nice, thanks a lot for all the work! I have a few comments/suggestions - I can certainly do some of it myself, but first wanted to check what you think and if I'm going to add/change code in which branch I should do so...

avoid round-off error in B1_frq The repr function of class DataPoint returns B1_frq with 1 decimal in exponent notation. This causes round-off errors in the B1 field strength in the output (*.dat) files, which is a bit annoying if you read this file for further downstream processing. It would be good to increase the number of decimal points either in exponent or floating point notation.
indicate the "major" state chemical shift in plot files In situations where the major/minor dips are not well resolved but give rise to asymmetry, it helps for visualization to indicate the major-state chemical shifts in the plots. I my downstream plotting of the fitting results, I currently use this: plt.plot([N15freq, N15freq], [min(data[res]['ratio']), max(data[res]['ratio'])], 'k--')
plot the data/fits for several B1 field strengths in one PDF file Right now the data is plotted in separate PDF files, which is very good for a closer inspection of the fits. For a "final" fit and to quickly look at all the data together it might be worth considering to also plot different B1 fields in one PDF file.
Monte-Carlo fitting in parallel (e.g., on a cluster) Let's say I want to do 500 MC steps, I can use the "--mc 500" options and this will work perfectly fine when doing this on a single core. It is of course more efficient to do this in parallel and, for example, submit twenty times a "--mc 25" run and combine the results afterwards. My concern here is that all these 20 jobs will start pretty much simultaneously on the cluster and, therefore, the seed could be identical - in other words it's not so random anymore... One way around it could be to introduce a command-line parameter to set the seed for each of the 20 jobs, but then I am not sure whether the 25 runs will be independent or not - perhaps "jumpahead" can be used to do this. Any thoughts or am I just being overly paranoid...?

Thanks! Renee

viochemist commented 9 years ago

Hello,

Those are all good comments! The first point, I believe it is kept to one decimal to make parameter adjustment easier. For instance, if you wanted to set the intensities of residues for only B1 = 25.0 you can just make the heading: [I0, 25.0] or something like that. More decimals makes it a little more cumbersome to do the parameter matching and doesn't significantly affect the fit, since inhomogeneity is taken into account anyways. I could be wrong on this.

The second point. I like it!

Third point. I think everyone would like something like this, but it's perhaps tedious to get to work nicely in a general way (formatting, colors, how to treat different Trlx, what if there's missing data for one field and not another, etc). I'm sure it can be done, but I think it could potentially be a bit of a headache.

Last point. I'd never thought about this issue. Have you run a test and seen whether the same seed is used? My feeling is that different processors would produce different random numbers, but I really have no idea. It wouldn't be difficult to, say, increment a seed like we do for the 'chemex-dump' directories depending on what already exists. But that could run into its own conflicts on a large scale too, I'm sure.

Those are my thoughts, anyways.

Alex

reneeotten commented 9 years ago

hi Alex,

thanks for your quick answer - I response to that:

1st: using only one decimal for the field strength is totally fine and I agree that this will not affect the fitting. What I am suggesting is only to change how the B1 field is written to the output (_.dat) file. It's done in the data_point.py file(s) as follows: output.append('{b1frq:6.1e}'.format(*self.par)) So if I have for example 26.4 Hz it will be written as 2.6e+01 -- so you actually lose the decimal value.

3rd: I think for the CEST data it shouldn't be too complicated unless you're going to simultaneously fit data at different temperatures, but I do see your point! I am totally happy to make my own "final" plots afterwards.

4th: use of the "random" module in parallel processes is widely discussed on the internet, I think there are specific libraries for this if you really want to have it properly implemented.... As far as I know you cannot get the seed value back - so it's difficult to test. My understanding of reading the documentation is that you'll use the clock to get a seed if you don't specify it yourself. Perhaps switching to SystemRandom might already help a bit - this would use "sources provided by the operating system", probably including the time. But to me it seems that a acceptable "poor-mans-version" of it (i.e., without going to special libraries) would be to allow for a command-line variable to input the seed and set it just before the "for loop" in line 100 of main.py That would make sure that the independent jobs will start from a different position and for the individual number of MC runs I think it's fair to assume it will be random (enough).

gbouvignies commented 9 years ago

Hi Renee and Alex!

Renee, thanks for all the good suggestions!

Good point. 5 digits should be fine?
I really like the idea of adding major state position in CEST plots. Added to my todo list.
The general idea behind the plots is to provide a quick way to check the fits. To keep things simple and consistent among the different experiments, I have chosen to output a 'pdf' file (as well as a 'dat' file) per 'experiment' file. Of course combining several experiments onto the same plot is often desirable: cpmg profiles at different fields/temperatures, cest profiles at different b1 fields... But for the time being I prefer to keep the plotting part of chemex as simple as possible and let users play with the '.dat' files. A solution could be to create a semi-independant script that would take '.dat' files as inputs and make the combined plot. Does this sound ok to you guys?
This is a really good point. I believe this is something I checked long ago and I was then convinced that randomness was assured across the different calculations. That being said I would not bet on it today. As far as I understand it, the 'random' library sometimes relies on the computer's clock to seed the generator, which can be bad – as you rightly mentioned. But most of the time it uses the file /dev/random (or /dev/urandom) that uses all sort of "surrounding" noise for seeding (see here and there). In this case, there's probably nothing to worry about, but to be 100% sure we can anyway implement manual seeding, like "--mc 25 324" where "324" represents the seed.

reneeotten commented 8 years ago

hi Guillaume and Alex,

I am interested in constraining the fitting parameters (with "lmfit") and see that work towards this is being done in the branch "feature/from_point_to_profile". I started poking around there a little to get a better understanding of the code. It seems to work pretty well and gives consistent results with earlier ChemEx version, at least for the few simple examples I tested, but I guess it is still work in progress or do you consider this branch "tested & approved"?

When I started following the code from main.py , I noticed some issues in parsing.py in particular when using the positional argument "info'. As a novice with git, I fiddled around a bit with it, and created a branch trying to fix this (not very important) issue - see https://github.com/reneeotten/chemex/tree/ro_fix_info [currently only for cest.iph.2st]

Anyway, I just wanted to let you know I am interested in contributing to the project and would be happy to discuss and/or fix stuff while going through and trying to understand the code-base and testing/using it. In addition, I could implement some of the suggestions in this thread as well. But before doing so, I would appreciate your input on the current status of this branch (e.g., do you have an update local branch) and if there are specific areas where you could use help.

Have a nice weekend! Renee

viochemist commented 8 years ago

Hello,

Sorry for the slow response. The 'from_point_to_profile' is work that Guillaume is mostly doing to improve performance and, as you say, allow constraints and relations of parameters. From talking with him, it seems it's coming along well, but I don't know how "approved" it is, at the moment.

As far as contributing goes, I think we'd all be happy to have more eyes on the project! As you can imagine, we all get pretty busy with our "day jobs" and can't contribute as much as we'd hope. I'll let Guillaume chime in about any systems or conventions to the code. And I forget how becoming a contributor works, whether you need to be invited or simply request access. If the latter, I think the answer is "go right ahead!"

reneeotten commented 8 years ago

hi Guillaume (and others),

thanks for pushing updates to new branch, it looks very nice! I started going through the code a little bit and, as I said earlier, am interested in helping out to get this to a production-ready, stable version. How would you consider this version (i.e., what parts are tested and work properly, what needs work)?

I am not completely sure how you want to go about discussing changes to the code and collaborating on it, but I guess pull requests are intended for this. I'll go ahead and open two pull requests soon-ish for some rather trivial things; we can see how that goes and take it from there.

Renee

gbouvignies commented 8 years ago

Thanks Renee for your message! I'm glad you like the new version of the program and really appreciate your help making it even better. I also think that pull request is the way to go for discussing and submitting changes to the code. I still need to familiarise myself with it though.

The development of the new version first aimed at improving the performance of the code, but later focused on simplifying it as well as adding new features (like constraints and relations of parameters). I believe enough has now been achieved, both in terms of code optimization and new features, to justify a release. However, as you pointed out, a fair bit of polishing is first required.

1. "chemex info": This is currently broken (as mentioned in one of your Pull Request) and I am still debating how this should be fixed. As exchange model and experiment type are now dissociated, it is not clear to me how a set of fit parameters for a specific experiment can be provided... It is probably better to use the docstring of each experiment file (like 'experiments/cest/profiles/ip.py') like a 'readme' file including a full description of the experiment and mention of what are the experimental parameters the user should provide for the fit (something like a template of the 'experiment.cfg' file, where each parameter is described). There's probably no need to implement any 'fancy' code to automatically grab these parameters from the code itself as it was the case in previous versions.

2. "shifts" experiments: Even these do work, they don't follow the new 'coding style' where 'profiles' and exchange model are dissociated. Needs a general way to compute the shift regardless of the employed model.

3. Installation: The new version is developed for Python 3 and imports 'lmfit'. None of them is widely used. Therefore it's probably a good idea to provide easy-to-follow steps to install the program. I am more and more convinced that (ana)conda is the way to go for this purpose as environments with specific versions of python and libraries can easily be set up. I've already added the 'environment.yml' file which allows one to create an environment with all requirements for chemex. Would also be nice to add chemex to the anaconda server, so that it can be installed by just typing 'conda install chemex'. On the same note, 'chemex' should be added to the PyPI (Python Package Index) server. All of that should be done once a stable version is available.

4. Code review and bug search: This one is self explanatory. The more eyes on the code and the more people using this version, the better. Feedbacks of any kind, from bug reports to suggestions on how to improve the code, are more than welcome.

5. Long term goals (not necessary for the next release...):

Writing a documentation
Writing a tutorial
GUI

I probably forget a lot of things that should be done. Please do not hesitate to add anything to this list you think relevant.

reneeotten commented 8 years ago

hi Guillaume, I certainly agree with all the points you mention! It will take me a bit of time to go through the code and figure stuff out, but once I am done with that (and hopefully already do something on the way) we can see what parts have the highest priority and how to tackle them.

Thanks for merging the PRs! I think it would be good to commit small(ish) pieces code and write clear commit messages and then push often to the repository. Before pushing the code you can always do a "rebase" to combine commits/messages to keep a reasonable clean history. Of course, once it's pushed you shouldn't rewrite history anymore :)

For example, now there are 3 messages relating to the merge of a single PR, that seems a bit too much and doesn't improve readability. It seems you changed a few things in my PR and then merged it, correct? I am sure there is some best-practice on how to do this, but again it's all part of the learning curve!

gbouvignies commented 8 years ago

Hi Renee, small commits with clear description is definitely the way to go. I can't agree more with you on this point. I also realize there is much room for improvement on my side ;)

Merging PR19 was kind of messy. Part of if came from the fact that I first merged PR20, which introduced a conflict with PR19 due to the fact that 'chemex/chemex.py' got modified in the process. When that was fixed, my IDE (pycharm) decided to automatically reformat all the new code, which did not play well with the merging... As you said, it's all part of the learning curve!

reneeotten commented 8 years ago

I like this link for the merge/rebase business... The relevant GitHub help page seems to be this one.

reneeotten commented 8 years ago

regarding your previous comment about "chemex info". In the short term it's probably easier to just add all the information into the docstring each experiment.

However, I think it would be good to do some additional coding to extract that information from the code (it seems to me that it's all present in the function "create_defaultparams" in chemex/bases//_.py). The reason is that I think it would be helpful to introduce a positional argument "init" (or similar) that creates a default directory structure and input files. In that case we would need the same information to write for example parameter and experiment files.

gbouvignies commented 6 years ago

Lots of these things are still relevant today, but I'd rather keep the issues more focused, dealing on a single point.