Split on molecules and not conformations in dataloader

choderalab / modelforge

Infrastructure to implement and train NNPs

https://modelforge.readthedocs.io/en/latest/

MIT License

11 stars 4 forks source link

Split on molecules and not conformations in dataloader #39

Closed wiederm closed 11 months ago

wiederm commented 11 months ago

As mentioned in the tile, we are currently splitting on conformations and should change this to a splitting on molecules

ArnNag commented 11 months ago

I think I removed the record index @chrisiacovella was referring to since it was redundant with dataset["n_confs"], which is the count of conformers in each record. So in order to get the split indices, the splitting strategy's split method would have to shuffle an index with the length of dataset["n_confs"] , split these indices, then convert this to atomic_subsystem_indices_referencing_dataset (as in https://github.com/choderalab/modelforge/blob/48281ac742d476d7e9371c59e9c7b586205c8dce/modelforge/dataset/dataset.py#L569).

wiederm commented 11 months ago

Any progress here? Is there anything I can help with?

ArnNag commented 11 months ago

~#41~ #42 should resolve this.

chrisiacovella commented 11 months ago

It seems to make sense to remove the record_index array (that was a bit of a kludge to work with the prior structure of the reader).

I put together some example datafiles that basically just duplicate some of the data in the qm9 dataset (just duplicating some of the underlying arrays), that should be useful for testing (we can add these in as additional tests for the qm9 dataset, maybe change from a single bool "unit_testing" to some sort of specification of type of test we want).

10 records total, 24 total conformers: https://drive.google.com/file/d/1xdGoqEkWxmxysJIDiienX4TOeZccCT9s/view?usp=sharing 100 records, total, 200 total conformers: https://drive.google.com/file/d/1Ror0KnopG0KR4EvV8yagWq0G5MsU66Md/view?usp=sharing

wiederm commented 11 months ago

This has been implemented in PR #42