Use caption to suggest depictions

commons-app / apps-android-commons

The Wikimedia Commons Android app allows users to upload pictures from their Android phone/tablet to Wikimedia Commons

https://commons-app.github.io/

Apache License 2.0

1.03k stars 1.23k forks source link

Use caption to suggest depictions #5422

Open nicolas-raoul opened 10 months ago

nicolas-raoul commented 10 months ago

If I caption my picture "Papilio machaon on Asteraceae flower in Croatia", then the app should ideally suggest these depictions:

papilio machaon
asteraceae
flower
Croatia

First step: Parse the caption into expressions could be asked to an LLM if one is available locally on the device. That means only on Pixel 8 Pro for now it seems (Samsung Galaxy S24 will follow), and we need to get approved to use the API: https://ai.google.dev/tutorials/android_aicore On devices where no such technology is available, the app can fall back to some more traditional tokenization, maybe even split on each space character.

Second step: Search for 1 or 2 Wikidata items that match each expression found in the first step.

Third step: Add these suggestions to the list of all suggestions.

rohit9625 commented 10 months ago

I don't think splitting the caption would work, because how do we identify which words need to be shown in depictions?

nicolas-raoul commented 10 months ago

@rohit9625 It is pretty easy using LLM:

Even splitting word-by-word would work reasonably well.

Prompt for further improvements:

You are a syntax parser that understands all human languages. For each input sentence, extract all nouns from the sentence. Include each adjective with its noun. Do not output anything but the list of nouns.

Example: "Papilio machaon on Asteraceae flower in Croatia"
 - papilio machaon
 - asteraceae
 - flower
 - Croatia

Now do the same for: "Oxcart passing on a wooden bridge in Papua New Guinea"

RitikaPahwa4444 commented 10 months ago

That means only on Pixel 8 Pro for now it seems

Is on-device execution necessary for this use case? The API is available for target API level 21 and higher, and should work if we're not going for Gemini Nano.

nicolas-raoul commented 10 months ago

Is on-device execution necessary for this use case?

On-device execution is necessary. As a privacy policy, we do not send any HTTP request to any server outside of Wikimedia (and Mapbox until recently but hopefully this should be resolved soon).

Details: https://github.com/commons-app/commons-app-documentation/blob/master/android/Privacy-policy.md#privacy-policy

RitikaPahwa4444 commented 10 months ago

Thank you for clarifying! Since captions are publicly available, I'd not thought of it from this perspective. This seems like an on-device LLM, but I'm still exploring😅.

nicolas-raoul commented 10 months ago

Great link thanks Ritika! I wonder how many megabytes that would add to the app... if not too many that's definitely an option.

kanahia1 commented 8 months ago

Hey @nicolas-raoul, I have found a model which we can utilize to achieve NER https://huggingface.co/dslim/bert-base-NER. They have provided tf model which we can to convert to .tflite model, then can do the implementation in the app.

I tested it with Papilio machaon on Asteraceae flower in Croatia these were the results

[
  {
    "entity_group": "PER",
    "score": 0.4467669427394867,
    "word": "Pa",
    "start": 0,
    "end": 2
  },
  {
    "entity_group": "PER",
    "score": 0.8728312849998474,
    "word": "mac",
    "start": 8,
    "end": 11
  },
  {
    "entity_group": "MISC",
    "score": 0.9185755848884583,
    "word": "Asteraceae",
    "start": 19,
    "end": 29
  },
  {
    "entity_group": "LOC",
    "score": 0.9997634291648865,
    "word": "Croatia",
    "start": 40,
    "end": 47
  }
]

I believe accuracy is good as it provides us with Asteraceae and Croatia

kanahia1 commented 8 months ago

@nicolas-raoul, I have converted the model to .tflite ~ 411 MB which is very large when we have to use in Android App (as it will increase the App size).

Is it possible we can use MLKit by Google in the App ? https://github.com/googlesamples/mlkit/tree/master/android/entityextraction

nicolas-raoul commented 8 months ago

@kanahia1 Great research, thanks!

I feel like 50MB should be maximum...

MLKit sounds great! Can it extract keywords? How big is it? 🙂

kanahia1 commented 8 months ago

@nicolas-raoul Sure, It can extract https://github.com/googlesamples/mlkit/tree/master/android/entityextraction?rgh-link-date=2024-03-04T06%3A00%3A08Z Entity extraction has an app size impact of up to ~5.6MB.

They have provided examples -

Input Text - Meet me at 1600 Amphitheatre Parkway, Mountain View, CA, 94043 Let’s organize a meeting to discuss. Output Text - 1600 Ampitheatre Parkway, Mountain View, CA 94043

More examples are available here

kanahia1 commented 8 months ago

@nicolas-raoul I tested MLKit, results I found were not satisfactory. I think dropping the idea of using MLKit would be better until the Google make it more effective since it is currently in beta.

Test 1 
Input: Dust storm approaching Stratford, Texas
Output: There are no entities detected

Test 2
Input: Aerial view of the city of Mayfield on December 12.
Output: December 12.

Test 3
Input: Interior view of the main nave of the Cathedral in Narbonne, France. The Roman Catholic church is dedicated to Saints Justus and Pastor and it begun its construction in 1272. The choir was finished in 1332, but the rest of the gothic building was never completed, as the result of many factors including sudden changes in the economic status of Narbonne, its unusual size and geographical location (to complete it would have meant demolishing the city wall) and financial constraints.
Output: There are no entities detected

Test 4
Input: Mt Technical from Lewis tops, Lewis Pass Scenic Reserve, New Zealand
Output: There are no entities detected

If I get enough time by end of this month, I will try to train a custom model 🙂