mfherbst commented 3 years ago

Needs the changes from https://github.com/adc-connect/libtensor/pull/9.

Progress

[x] Tests pass with libxm allocator locally
[x] Test libxm allocator on a big system
[x] Test libxm allocator on a really big system (i.e. exceeding the RAM of a machine)
[x] Add to docs how to use libxm allocator

@maxscheurer Feel free to play with it with a nice example. Libxm seems to write everything to disk and prefetch as it goes. That means that the memory requirement is super small, but also that it's only worth it if you have a fast disk and a calculation that no longer fits into your available RAM. But in that sense it closes a gap that adcc still has.

maxscheurer commented 3 years ago

When I tried libxm on Thursday, I had the problem that when specifying a custom pagefile dir, this directory was always deleted (probably because all allocators are being shut down once the memory pool is re-initialised?). I don't know if you had similar issues?

Other than that, I don't have time to test this over the weekend, but I will let you know early next week if everything is fine. My nvme ssd is probably suitable for some testing, and I've got a new USB 3.1 external drive (as fast as internal SSD) for my mac, so that would be cool to test 😬

mfherbst commented 3 years ago

I don't know if you had similar issues?

Not really. Everything worked the way I expected it to work. The whole allocator setup in libtensor is very inflexible. So literally the only way you can expect libxm to operate sensibly is if you run exactly once per python process:

import adcc
adcc.memory_pool.initialise(allocator="libxm")

I suppose you did that, but just to check.

mfherbst commented 3 years ago

My nvme ssd is probably suitable for some testing

Yes exactly. On those types of systems using libxm or not should not make much of a difference given that the problem is large enough --- at least if the promises made in their publication hold true.

maxscheurer commented 3 years ago

I don't like that the specified pagefile directory is deleted after the run... I vote for creating an extra tmpdir inside pagefile_directory which can then be deleted... This is already the case when /tmp is used.

maxscheurer commented 3 years ago

I will run a test job once the node with fast SSD is free... So I can test the same job once with RAM/std allocator and once with libxm.

maxscheurer commented 3 years ago

I'm a bit disappointed right now: Timer 29.7m lifetime with RAM vs Timer 8.5h lifetime with libxm 😦 for the same job. I'm worried I did some mistake with the hardware, but I don't think so....

mfherbst commented 3 years ago

First thing to check: Which version of libtensor did you use? Did you use the most recent one (freshly compiled and installed) or could it be you might have still had an old one on disk, which got used instead?

Also could well be that the job was too small or you used too many threads. The libxm design only pays off if the time needed to do the contractions is roughly the same as the time needed to fetch the data from disk.

maxscheurer commented 3 years ago

First thing to check: Which version of libtensor did you use? Did you use the most recent one (freshly compiled and installed) or could it be you might have still had an old one on disk, which got used instead?

Yeah, I used a freshly built libtensor+xm

Also could well be that the job was too small or you used too many threads. The libxm design only pays off if the time needed to do the contractions is roughly the same as the time needed to fetch the data from disk.

You're right, with less threads, the runtime for the xm job stays nearly the same as before (7.7h). I can try with a larger basis set and see what happens.

mfherbst commented 3 years ago

I can try with a larger basis set and see what happens.

Yes that would probably be good. Let me know what results you get.

maxscheurer commented 3 years ago

Needs a bibtex_bibfiles = ['ref.bib', 'pub.bib'] in docs/conf.py to build the docs with the newest sphinx version... I've added this in #119.

mfherbst commented 3 years ago

Yes I know :+1:. I had the change locally, but did not push, because I saw you fixed this elsewhere.

maxscheurer commented 3 years ago

I think we should pin libtensorlight >= 3.0.0 for conda builds because the changes here won't be backward-compatible.

mfherbst commented 3 years ago

I'll not bother adding docs for this ... it's not super useful anyway, so I'd rather not raise too high hopes.

maxscheurer commented 3 years ago

Agreed. Feel free to merge it and bump to 0.15.8 once the CI is done.

mfherbst commented 3 years ago

Any clue why the MacOS tests fail?

maxscheurer commented 3 years ago

The failure while downloading the json file with HTTP Error 403 is more or less random, also happening all the time in my other PR 😠

mfherbst commented 3 years ago

Hmm ... that's a bit annoying.

maxscheurer commented 3 years ago

The problem is that I cannot reproduce it locally on macOS, so it's hard to find a fix ☹️ Maybe bump the number of retries to 10?

maxscheurer commented 3 years ago

This biggest problem that I can see with this issue is the deployment pipeline for macOS... 👎

mfherbst commented 3 years ago

Let's see with 10 tries. If that does not help, I'm not sure what we can do.

maxscheurer commented 3 years ago

I think one reason is that the GitHub API request is without an access token, and it could be that GitHub blocks our requests because all the macOS VMs have the same IP 😦 (result from random web search). Maybe we could implement a fallback option to directly download a specific version release without the API call?

mfherbst commented 3 years ago

Well ... that would be quite messy. Maybe a better idea is to introduce an environment variable with the url, that we would just set in the case of GHA?

maxscheurer commented 3 years ago

Well ... that would be quite messy. Maybe a better idea is to introduce an environment variable with the url, that we would just set in the case of GHA?

Like LT_DOWNLOAD_LINK or sth of the sort?

mfherbst commented 3 years ago

ADCC_LIBTENSOR_DOWNLOAD_URL I'd say. Should be explicit ... because it's not that people will use that on a regular basis.

maxscheurer commented 3 years ago

Yes, perfect.

maxscheurer commented 3 years ago

Thanks for making the change! If it works, we need to add the same to publish.yml I think.

mfherbst commented 3 years ago

I don't think so because it uses conda to get libtensor, no?

maxscheurer commented 3 years ago

I don't think so because it uses conda to get libtensor, no?

The conda deploy pipeline yes, but not the pip one.

mfherbst commented 3 years ago

but that uses ubuntu

maxscheurer commented 3 years ago

but that uses ubuntu

Oh yes, of course 😆

maxscheurer commented 3 years ago

Damn now coveralls upload fails on all pipelines 🤯

mfherbst commented 3 years ago

coveralls had some security issues a while back, perhaps we need to update to a newer version of the GHA?

maxscheurer commented 3 years ago

We currently do a manual install of coveralls + upload, pulling the most recent version from pip... Maybe we need to switch to the GHA for it to work?

mfherbst commented 3 years ago

@maxscheurer see lemurheavy/coveralls-public Issue 1543 ... we can only wait.

maxscheurer commented 3 years ago

Drat... want to merge anyways?

mfherbst commented 3 years ago

Yes ok, let's do it.

adc-connect / adcc

Adapt AdcMemory for libxm #118

Progress