It isn’t clear if the text extraction has come from AC, or externally from the sibils fetch api or ocr api. For example, PMC2365968_supplementary/Processed/1752-1947-2-112-S3.tiff_bioc.json is clearly wrong because there is no text in the image. It isn’t clear from the bioc.json where the text extraction has come from. I guess the OCR in this case?
I propose we add a “textsource” parameter to the BioC output file when we process images only, with the value being the URL of the api service. This parameter goes at the same level (in the documents array) as “inputfile”.
For example, if the text comes from the fetch API (addition highlighted):
I propose we add a “textsource” parameter to the BioC output file when we process images only, with the value being the URL of the api service. This parameter goes at the same level (in the documents array) as “inputfile”.
For example, if the text comes from the fetch API (addition highlighted):
"documents": [ { "id": 1, "inputfile": "PMC10021083_supplementary/Raw/sj-jpg-1-tah-10.1177_20406207231155991.jpg", “textsource” : “https://sibils.text-analytics.ch/api/fetch?ids=PMC10021083_sj-jpg-1-tah-10.1177_20406207231155991.jpg&col=suppdata”, "infons": {}, "passages": [ { ….
For example, if the text comes from the OCR API (addition highlighted):