Closed wiederm closed 11 months ago
I think I removed the record index @chrisiacovella was referring to since it was redundant with dataset["n_confs"]
, which is the count of conformers in each record. So in order to get the split indices, the splitting strategy's split method would have to shuffle an index with the length of dataset["n_confs"]
, split these indices, then convert this to atomic_subsystem_indices_referencing_dataset
(as in https://github.com/choderalab/modelforge/blob/48281ac742d476d7e9371c59e9c7b586205c8dce/modelforge/dataset/dataset.py#L569).
Any progress here? Is there anything I can help with?
~#41~ #42 should resolve this.
It seems to make sense to remove the record_index array (that was a bit of a kludge to work with the prior structure of the reader).
I put together some example datafiles that basically just duplicate some of the data in the qm9 dataset (just duplicating some of the underlying arrays), that should be useful for testing (we can add these in as additional tests for the qm9 dataset, maybe change from a single bool "unit_testing" to some sort of specification of type of test we want).
10 records total, 24 total conformers: https://drive.google.com/file/d/1xdGoqEkWxmxysJIDiienX4TOeZccCT9s/view?usp=sharing 100 records, total, 200 total conformers: https://drive.google.com/file/d/1Ror0KnopG0KR4EvV8yagWq0G5MsU66Md/view?usp=sharing
This has been implemented in PR #42
As mentioned in the tile, we are currently splitting on conformations and should change this to a splitting on molecules