Fixes #78 Reading comprehension synthetic data regex improvements

Enhance regexes to post process LLM output for reading comprehension dataset creation.

Improves pubmed 500 dataset generation from yielding only 7% of the dataset to yielding 93% of the dataset:

Total texts missed: 466 out of 501
Size of the train set: 33. Size of the validation set: 2

to:

Total texts missed: 36 out of 501
Size of the train set: 441. Size of the validation set: 24

Changes in this PR:

Improve regex extraction as described above
Add a new optional parameter to the reading comprehension pipeline to save the "unprocessed output" - eg, the raw output from the LLM before trying to extract Q&A. This is helpful when analyzing errors in the Q&A extraction.
Add try/except handling when calling tokenizer.apply_chat_template(). This blew up during development on a few inputs, and it caused the entire data pipeline to fail because of a few bad inputs with None as the content. After this fix it will log an exception with a stacktrace, but will continue with the rest of the dataset and provide a much better user experience. Also note that it should never need to kick in as long as the upstream data generation is generating correct inputs.

Validating on the Nuclear patent dataset

I used a tiny subset of the nuclear patent dataset to validate that it also works on that dataset:

nuclear_patents_25_arcee_format.csv

and it yields 85% of the dataset:

12/04/2023 20:37:02 - INFO - dalm.pipelines.reading_comprehension_pipeline -  Statistics
12/04/2023 20:37:02 - INFO - dalm.pipelines.reading_comprehension_pipeline - Total number of successfully extracted q&a: 24
12/04/2023 20:37:02 - INFO - dalm.pipelines.reading_comprehension_pipeline - Total texts missed: 4 out of 28
12/04/2023 20:37:02 - INFO - dalm.pipelines.reading_comprehension_pipeline - Total texts processed: 28

Here are some example extracted reading comprehension texts from the nuclear patent dataset.

Manual e2e testing

Here are some generated chat completions on the pubmed500 dataset.

Example generated doc 1

``` [ { "content": "Based on the following text: \n Advances in neuroradiological planning techniques in deep brain stimulation have put the need for intraoperative electrophysiological monitoring into doubt. Moreover intraoperative monitoring prolongs surgical time and there is potential association between the use of microelectrodes and increased incidence of hemorrhagic complications. The aim of this study was to analyze the correlation between the anatomically planned trajectory and the final subthalamic electrode placement after electrophysiological monitoring in patients with Parkinson\"s disease and its change with the increasing experience of the surgical team.\", \n I'd like you to answer a few questions\n", "role": "user" }, { "content": "what is the aim of the study mentioned in the text, and why is this analysis important?", "role": "user" }, { "content": "the aim of the study is to analyze the correlation between the anatomically planned trajectory and the final subthalamic electrode placement after electrophysiological monitoring in patients with parkinson's disease, and this analysis is important because advances in neuroradiological planning techniques in deep brain stimulation have put the need for intraoperative electrophysiological monitoring into doubt, and the study aims to determine if the anatomically planned trajectory is sufficient without the need for intraoperative monitoring.", "role": "assistant" }, { "content": "what are the potential drawbacks of intraoperative monitoring during deep brain stimulation surgery?", "role": "user" }, { "content": "intraoperative monitoring prolongs surgical time, and there is potential association between the use of microelectrodes and increased incidence of hemorrhagic complications.", "role": "assistant" }, { "content": "can you provide an example of a deep brain stimulation planning technique that has put the need for intraoperative electrophysiological monitoring into doubt?", "role": "user" }, { "content": "the text mentions that advances in neuroradiological planning techniques in deep brain stimulation have put the need for intraoperative electrophysiological monitoring into doubt. however, the text does not provide a specific example of such a technique.", "role": "assistant" }, { "content": "how does the study mentioned in the text address the potential association between the use of microelectrodes and increased incidence of hemorrhagic complications during deep brain stimulation surgery?", "role": "user" }, { "content": "the study does not directly address the potential association between the use of microelectrodes and increased incidence of hemorrhagic complications during deep brain stimulation surgery. the text only mentions that there is potential association, but it does not provide any details on how the study addresses this issue.", "role": "assistant" } ] ```

Example generated doc 2

This one has a null value - which should probably be fixed. ``` [ { "content": "Based on the following text: \n Deep brain stimulation (DBS) therapy relies on both precise neurosurgical targeting and systematic optimization of stimulation settings to achieve beneficial clinical outcomes. One recent advance to improve targeting is the development of DBS arrays (DBSAs) with electrodes segmented both along and around the DBS lead. However, increasing the number of independent electrodes creates the logistical challenge of optimizing stimulation parameters efficiently., \n I'd like you to answer a few questions\n", "role": "user" }, { "content": "the statement \"dbs therapy does not rely on precise neurosurgical targeting and systematic optimization of stimulation settings\" is:", "role": "user" }, { "content": null, "role": "assistant" }, { "content": "deep brain stimulation (dbs) therapy aims to achieve _____ clinical outcomes by relying on _____ neurosurgical targeting and _____ optimization of stimulation settings.", "role": "user" }, { "content": "beneficial; precise; systematic", "role": "assistant" }, { "content": "create a sentence using the keywords 'dbs arrays', 'electrodes', 'along', and 'around'.", "role": "user" }, { "content": "dbs arrays with electrodes segmented both along and around the dbs lead", "role": "assistant" } ] ```

Example generated doc 3

``` [ { "content": "Based on the following text: \n Deep learning has received significant attention recently as a promising solution to many problems in the area of artificial intelligence. Among several deep learning architectures, convolutional neural networks (CNNs) demonstrate superior performance when compared to other machine learning methods in the applications of object detection and recognition. We use a CNN for image enhancement and the detection of driving lanes on motorways. In general, the process of lane detection consists of edge extraction and line detection. A CNN can be used to enhance the input images before lane detection by excluding noise and obstacles that are irrelevant to the edge detection result. However, training conventional CNNs requires considerable computation and a big dataset. Therefore, we suggest a new learning algorithm for CNNs using an extreme learning machine (ELM). The ELM is a fast learning method used to calculate network weights between output and hidden layers in a single iteration and thus, can dramatically reduce learning time while producing accurate results with minimal training data. A conventional ELM can be applied to networks with a single hidden layer; as such, we propose a stacked ELM architecture in the CNN framework. Further, we modify the backpropagation algorithm to find the targets of hidden layers and effectively learn network weights while maintaining performance. Experimental results confirm that the proposed method is effective in reducing learning time and improving performance., \n I'd like you to answer a few questions\n", "role": "user" }, { "content": "1 (complete-the-sentence): deep learning has received significant attention recently as a promising solution to many problems in the area of artificial intelligence. complete the following sentence: deep learning has shown _____ performance in the applications of object detection and recognition when compared to other machine learning methods.", "role": "user" }, { "content": "superior", "role": "assistant" }, { "content": "2 (true/false): a cnn can be used for image enhancement before lane detection by excluding noise and obstacles that are irrelevant to the edge detection result. is this statement true?", "role": "user" }, { "content": "yes, the statement is true.", "role": "assistant" }, { "content": "3 (domain-keywords): using the keywords 'cnn', 'object detection', and'recognition', create a sentence that describes the superior performance of deep learning in these applications when compared to other machine learning methods.", "role": "user" }, { "content": "deep learning, specifically convolutional neural networks (cnns), demonstrate superior performance in the applications of object detection and recognition when compared to other machine learning methods.", "role": "assistant" }, { "content": "4 (normal q&a): what is the proposed learning algorithm for cnns in the text, and how does it reduce learning time while maintaining performance?", "role": "user" }, { "content": "the proposed learning algorithm for cnns in the text is an extreme learning machine (elm) architecture. this method can dramatically reduce learning time while producing accurate results with minimal training data by calculating network weights between output and hidden layers in a single iteration. the proposed method also modifies the backpropagation algorithm to find the targets of hidden layers and effectively learn network weights while maintaining performance.", "role": "assistant" } ] ```

Example generated doc 4

This one isn't even generating questions .. it looks meaningless. ``` [ { "content": "Based on the following text: \n Machine learning-based approaches now outperform competing methods in most disciplines relevant to diagnostic radiology. Image-guided procedures, however, have not yet benefited substantially from the advent of deep learning, in particular because images for procedural guidance are not archived and thus unavailable for learning, and even if they were available, annotations would be a severe challenge due to the vast amounts of data. In silico simulation of X-ray images from 3D CT is an interesting alternative to using true clinical radiographs since labeling is comparably easy and potentially readily available., \n I'd like you to answer a few questions\n", "role": "user" }, { "content": "yes, this statement means that techniques using machine learning algorithms are currently surpassing other approaches in the majority of fields related to medical imaging analysis for diagnostic purposes.", "role": "user" }, { "content": "the text explains that in image-guided procedures, images are not saved for learning purposes and annotations would be extremely challenging due to the high volume of data involved.", "role": "assistant" }, { "content": "the text suggests that creating simulated x-ray images from ct scans is a possible alternative to using genuine medical images since labeling is less demanding and potentially more readily available.", "role": "user" }, { "content": "the text highlights that annotating images for machine learning in the context of image-guided procedures is a significant challenge because of the vast amounts of data involved. this means that accurately labeling each image would be a time-consuming and laborious process, making it a severe challenge.", "role": "assistant" } ] ```

Unit tests

See tests/datasets/reading_comprehension_generation/test_utils.py

arcee-ai / DALM