The new torch-crf implementation does not currently support storing CRF features on disk. This option would be beneficial for users who do not have a larger memory threshold.
In order to implement this successfully, we would have to mainly re-implement the scikit-learn train_test_split function used in pytorch_crf.py. This sounds like a good idea to me for two main reasons:
First obviously, being able to support storing CRF features on disk.
The second is that, the function currently does not allow stratifying if we have labels that have only one instance. This makes us implement the additional overhead of duplicating unique examples if the stratify option is passed. A custom train_test_split can perhaps bypass this requirement, as I'm not convinced it is truly essential for achieving the intended effect of stratification.
It makes sense to have a separate PR for this as there are a lot of moving parts in the current PR, and this would be better evaluated as a stand alone change as we would also need to implement an efficient file-seeking mechanism for a file backed CRF feature.
The new torch-crf implementation does not currently support storing CRF features on disk. This option would be beneficial for users who do not have a larger memory threshold.
In order to implement this successfully, we would have to mainly re-implement the scikit-learn train_test_split function used in pytorch_crf.py. This sounds like a good idea to me for two main reasons:
It makes sense to have a separate PR for this as there are a lot of moving parts in the current PR, and this would be better evaluated as a stand alone change as we would also need to implement an efficient file-seeking mechanism for a file backed CRF feature.