Querent-ai / querent-python-research

Querent
https://querent.xyz
Other
2 stars 0 forks source link

Added ingested images #285

Closed Ansh5461 closed 6 months ago

Ansh5461 commented 7 months ago

Algorithm for Ingested Images -

  1. First we will search if we find any of the fixed entities in ocr text. If we find a binary pair, it goes straight through.
  2. If we find only 1 entity in ocr text, then we check in the page text. We pick the binary pair that occurs the most number of times in which the fixed entity found in ocr text is present. (Send all the highest occurring entity pairs)
  3. If we dont find any fixed entity in ocr text, we check in page text. We pick the binary pair that occurs the most number of times.
  1. First we will search if we find any binary entity pair in ocr text. If we find a binary pair, it goes straight through.
  2. If we find only 1 entity in ocr text, then we check in the page text. We pick the binary pair with highest confidence score in which the fixed entity found in ocr text is present. (Send all the highest confidence entity pairs)
  3. If we dont find any entities in ocr text, we check in page text. We pick the binary pair that has the highest confidence score.(Attach all the entities with image)
  1. Subject
  2. Subject Type
  3. Object
  4. Object Type
  5. Predicate: "has_image"
  6. Predicate Type: "has_image"
  7. Sentence: A textual representation or description of the relationship.
  8. unique_id: Hash of base64 encoding of the image
  9. doc_source
  1. ID: A unique identifier for the event which is Subject_Predicate_Object.
  2. Embeddings: A list of floating-point numbers representing the vector in high-dimensional space.
  3. Size: The dimensionality of the vector.
  4. Namespace: has image
  5. Sentence: A textual representation or description of the vector.
  6. unique_id
  7. blob
  8. doc_source

@saraswatpuneet sir this is what we are thinking of payload for each events. Can you verify if this is what we are expecting on quester side too??

saraswatpuneet commented 6 months ago

@Ansh5461 lets try to bring this to working state when you are back this week, meanwhile i will implement the blob storage part on quester side :rocket:

ngupta10 commented 6 months ago

Logic

1. Fixed Entities Provided

  ,fixed_entities =["geologists", "Earth","asphaltene", "Eagle ford", "Nitrogen", "industry"],
 sample_entities = ["method", "method", "method", "method", "method"]

- If entity pair(s) are found in image/ocr text, we release all the triples as events.

What-is-Geology

triple: {'subject': 'geologists', 'subject_type': 'method', 'object': 'earth', 'object_type': 'method', 'predicate': 'has image', 'predicate_type': 'has_image', 'sentence': "geology is the scientific study of the earth's structure, composition, and processes that shape its surface. geologists investigate the formation of rocks, minerals, and the earth's interior through methods like fieldwork, laboratory analysis, and technology. this discipline encompasses diverse fields such as paleontology, seismology, and geochemistry. "}

- If only one entity is found in image/ocr text, we then make entity binary pairs for this entity in the page text and release all the triples as events.

Below file has an image and we fixed the entity to ['Asphaltene', 'Nitrogen'], so it only finds one entity 'Asphaltene' in ocr_text but makes entity pairs using the page text.

Untitled 1 (2).pdf

triple: {'subject': 'asphaltene', 'subject_type': 'method', 'object': 'nitrogen', 'object_type': 'method', 'predicate': 'has image', 'predicate_type': 'has_image', 'sentence': 'asphaltene precipitation and deposition during nitrogen gas cyclic\nmiscible and immiscible injection in eagle ford shale and its impact\non oil recovery\nmukhtar elturki and abdulmohsin imqam*\n cite even though similar technologies have been used in\nunconventional reservoirs with some success stories in shale resources, cyclic gas\ninjection enhanced oil recovery (eor) is still a little-understood subject in boosting\noil recovery from unconventional reservoirs.'}

- If no entity is found in image/ocr text, we find binary entity pairs in the page text and release all the triples as events.

Untitled 1.pdf

triple: {'subject': 'eagle ford', 'subject_type': 'method', 'object': 'industry', 'object_type': 'unknown', 'predicate': 'has image', 'predicate_type': 'has_image', 'sentence': 'eagle ford shale has been of\nimportance in the oil and gas\nindustry with\nthe new advent\nof unconventional technology\nin recent years. previous\nstudies have shown\nthat eagle ford\nshale is a world-class source\nrock.'}

2. No Fixed Entities Provided

- If entity pair(s) are found in image/ocr text, we release all the triples as events.

What-is-Geology

triple: {'subject': 'minerals', 'subject_type': 'b-geopetro', 'object': 'earth', 'object_type': 'b-geopetro', 'predicate': 'has image', 'predicate_type': 'has_image', 'sentence': "geology is the scientific study of the earth's structure, composition, and processes that shape its surface. geologists investigate the formation of rocks, minerals, and the earth's interior through methods like fieldwork, laboratory analysis, and technology. this discipline encompasses diverse fields such as paleontology, seismology, and geochemistry. "}

- If only one entity is found in image/ocr text, we then make entity binary pairs for this entity in the page text and release all the triples as events.

Untitled 2.pdf

triple: {'subject': 'precipitation', 'subject_type': 'b-geopetro', 'object': 'deposition', 'object_type': 'b-geopetro', 'predicate': 'has image', 'predicate_type': 'has_image', 'sentence': 'asphaltene precipitation and deposition in earth during nitrogen gas cyclic\nmiscible and immiscible injection in eagle ford shale and its impact\non oil recovery\n'}

- If no entity is found in image/ocr text, we find binary entity pairs in the page text and release all the triples as events.

Untitled 1 (2).pdf

triple: {'subject': 'deposition', 'subject_type': 'b-geopetro', 'object': 'hydrocarbon', 'object_type': 'b-geopetro', 'predicate': 'has image', 'predicate_type': 'has_image', 'sentence': 'unconventional resources, like\nshale reservoirs, are widely recognized for their extremely low\npermeability and porosity.1 despite the fact that multistage\nhydraulic fracturing and horizontal well drilling techniques are\nused to extract the remaining oil from such reservoirs, only 4-\n6% of the trapped oil can be extracted, and the oil production\ndrops after a few months, attributing to the ultralow\npermeability.2-19 water injection is also one of the suitable\nstrategies for increasing oil recovery from conventional\nreservoirs; nevertheless, due to weak injectivity, insufficient\nsweep potency, and clay swelling concerns, this approach is not\nthe ideal solution for tight reservoirs.20,21 cyclic gas injection\noutperforms gas flooding methods in terms of enhancing oil\nrecovery, mainly in ultratight reservoirs.22,23 the total organic\ncarbon (toc) is the most important influencing parameter on\ngas injection in tight reservoirs because kerogen makes the\nsurface of the pore oil-wet, making the oil inside challenging to\nextract.24 due to the combination of multiphase fluids (i.e.,\ngas, oil, condensate, and water) and scales, multiphase flow\nproduction can create a number of challenges including wax\nand asphaltene deposition, hydrate formation, slugging, and\nemulsions.25 organic hydrocarbon particles settling in oil and\ngas reservoirs might create many flow assurance problems\nthroughout the extraction process. these materials may\nincrease flow resistance, causing production reduction or\neven pipeline plugging.26,27 crude oil is a complicated\ncomposition of hydrocarbons with different molecular weights\nreceived:\njuly 29, 2022\nrevised:\nseptember 27, 2022\narticle\npubs.acs.org/ef\n(c) xxxx american chemical society\na\nhttps://doi.org/10.1021/acs.energyfuels.2c02533\nenergy fuels xxxx, xxx, xxx-xxx\ndownloaded via missouri univ science & technology on october 11, 2022 at 15:40:56 (utc).\n'}

Ansh5461 commented 6 months ago

Screenshot from 2024-05-04 16-43-53