andrewhancox / moodle-filter_translations

8 stars 8 forks source link

Google Translate being called multiple times for the same text #108

Closed berthelemy closed 2 years ago

berthelemy commented 2 years ago

We are getting a lot of rows in the translations table, generated by Google Translate, where rawtext = substitutetext.

In a lot of these cases, part or some of the phrase has already been translated. It looks like things are being run past the translate API more than once.

For example:

Original source language = en target language = es rawtext = Cultivo de maíz en Ruanda substitutetext = Cultivo de maíz en Ruanda

This is causing our Google API credit limit to max out. NB. We now have over 11 million lines in our translations table.

berthelemy commented 2 years ago

This may be linked to #99

andrewhancox commented 2 years ago

Agreed that the two are linked - it sounds like everything is getting re-translated every time.

Can you pull the duplicate records for one use case and supply so I can take a look at what’s going on e.g.

select * from mdl_filter_translations where rawtext = 'Cultivo de maíz en Ruanda’

On 30 May 2022, at 08:58, Mark Berthelemy @.***> wrote:

This may be linked to #99 https://github.com/andrewhancox/moodle-filter_translations/issues/99 — Reply to this email directly, view it on GitHub https://github.com/andrewhancox/moodle-filter_translations/issues/108#issuecomment-1140827859, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJEYMW56OGLAKJJU64JI3LVMRYJRANCNFSM5XJRH65A. You are receiving this because you are subscribed to this thread.

berthelemy commented 2 years ago

Hi Andrew,

It turns out we have 64,016 lines in the mdl_filter_translations table.

I'm not sure how I got to 11 million! I must have misread something... sorry...

Here are a few query results, which may be useful:

Query 1

select * from mdl_filter_translations where rawtext = 'Cultivo de maíz en Ruanda’

There's just one result:

id md5key lastgeneratedhash targetlanguage contextid rawtext substitutetext substitutetextformat translationsource usermodified timecreated timemodified
74236 75227eaa95b73cf2e818bde22e14ac82 75227eaa95b73cf2e818bde22e14ac82 es 1 Cultivo de maíz en Ruanda Cultivo de maíz en Ruanda 1 20 0 1653734419 1653734419

Query 2

select * from mdl_filter_translations where substitutetext = 'Cultivo de maíz en Ruanda’

Gives two results:

id md5key lastgeneratedhash targetlanguage contextid rawtext substitutetext substitutetextformat translationsource usermodified timecreated timemodified
613 579412728bc0dc8732ff4524fec6fa8e 579412728bc0dc8732ff4524fec6fa8e es 1 Maize farming in Rwanda Cultivo de maíz en Ruanda 2 20 0 1652792600 1652792600
74236 75227eaa95b73cf2e818bde22e14ac82 75227eaa95b73cf2e818bde22e14ac82 es 1 Cultivo de maíz en Ruanda Cultivo de maíz en Ruanda 1 20 0 1653734419 1653734419

Query 3

SELECT COUNT(id) FROMmdl_filter_translations WHERE rawtext = substitutetext

returns 53,832 rows - ie. 84% of the translations table.

I don't understand why there's anything here.

Query 4

SELECT COUNT(id) FROMmdl_filter_translations WHERE rawtext = substitutetext AND targetlanguage != 'en'

returns 1173 rows, which implies that the rest are translations which are being re-translated.

Let me know if you need any more data.

Mark

berthelemy commented 2 years ago

One more piece of data:

Query 5

SELECT targetlanguage, count(id) AS 'Total' FROM mdl_filter_translations WHERE rawtext = substitutetext GROUP BY targetlanguage ORDER BY Total DESC LIMIT 50

Result:

targetlanguage Total
en 52673
ar 713
bn 163
fr 151
es 35
te 21
zh_cn 21
mr 18
hi 18
ps 16
pt_br 8
rw 7
ta_lk 2

That probably fits the profile of how much the site is being used in each language.

andrewhancox commented 2 years ago

That all looks correct to me. There is an architectural issue that we cannot get around - we don't know what language the text has been entered in so have to send it to google translate anyway - this means that for any given piece of text there will always be at least one instance where the source and translated values are the same.

To check for duplicated calls to google translate you need to do this:

select count(id), rawtext from mdl_filter_translations where rawtext = substitutetext group by rawtext, targetlanguage having count(id) > 1;

berthelemy commented 2 years ago

The query below returned 1014 rows.

select count(id), rawtext from mdl_filter_translations where rawtext = substitutetext group by rawtext, targetlanguage having count(id) > 1;

There are 5 text chunks repeated 5 times.

10 text chunks are repeated 4 times.

The rest are either repeated 3 or 2 times.

Mostly the text seems to be coming from profile fields.

However, there are quite a few rows like these below, where the text is definitely coming from course content:

Count Text
2 Image credits
2 Live session Cropping Strategies
2 Pre-course Assessment
2 Economic threshold
2
Economic threshold
2 Digital learning team 2022
2 Unit 2: Global View on IPM
2 Unit 11: The Big Five in Plant Protection
2 Unit 12: Prevention: An Overview
2 Unit 13: Prevention: Soil & Substrate Preparation
2 Unit 14: Prevention: Cropping Systems
2 Unit 15: Prevention: Clean Planting Material
2 Unit 16: Prevention: Crop Resistance & Tolerance
2 Unit 17: Prevention: Physical Measures
2 Unit 18: Prevention: Pest Transmissions
2 Unit 19: Prevention: Natural & Conservation Biocontrol
2 Prevention: Proceed with your Green & Yellow list
2 Unit 20: Monitoring & Decisions: An Overview
2 Unit 21: Monitoring & Decisions: Risks Assessment
2 Unit 22: Monitoring & Decisions: Population Dynamics
2 Unit 23: Monitoring & Decisions: Sampling & Recording
2 Unit 24: Monitoring & Decisions: Forecasting & Warning
2 Unit 25: Monitoring & Decisions: Evaluation & Decisions
2 Unit 26: Monitoring & Decisions: Case Study Oil Seed Rape
2 Monitoring & Decisions: Proceed with Your Green & Yellow List
2 Unit 28: Green: Physical Control
2 Unit 29: Green: Augmentation Biocontrol
2 Unit 30: Green: Technological Control
2 Unit 31: Green Direct Control: A Wrap Up
2 Green direct control: Proceed with your Green & Yellow list
2 Unit 32: Yellow Direct Control with Restrictions: An Overview
2 Unit 34: Yellow: Basic Substances
2 Unit 35: Yellow: Pesticides Mode of Activity
2 Unit 36: Yellow: Pesticides Mode of Action
2 Unit 37: Red: Banned Practices
2 Unit 38: Yellow: Risks to Users
2 Unit 39: Yellow: Risks to Consumers
2 Unit 40: Yellow: Risks to Environment
2 Unit 41: Yellow: Rational Pesticide Use Summary
2 Yellow Direct Control: Proceed with your Green & Yellow list
2 Unit 1: Introduction to Crop Fertilization in ICM
2 Unit 2: Nutrients and Soil Fertility
2 Further reading
2 Unit 4: Secondary Macronutrients
2 Unit 6: Nutrient Deficiencies
2 Unit 7: Soil Testing for Nutrients
2 Unit 8: Soil Organic Matter
2 Unit 9: Mineral Fertilizers and their Application
2 Unit 10: Organic Manure and its Application
2 Unit 11: Crop Fertilization - Calculating Nutrients
2 Unit 12: Case Study: Fertilization in Apple Production
2 Unit 14: Link Between Fertilization and the Environment
2 General Forum
2 Stage 2: Nursery plants
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Soil Insect Pest Management in Open-field Leafy Vegetables
2

Pest Situation 1

Few soil insect pests observed during soil preparation. Select a pest management option:

2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 1: Pre-planting - Pest Situation 1
2

Pest Situation 1

Few soil insect pests observed during soil preparation. Select a pest management option:

2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Continue
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 1: Pre-planting - Pest Situation 2
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 1: Pre-planting - Conclusion
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 2: Planting/Sowing - Pest Situation 1
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 2: Planting/Sowing - Pest Situation 2
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 3: Planting/Sowing - Conclusion
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 3, 4 & 5: Vegetative Growth to Fruiting and Harvesting - Pest Situation 1
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 3, 4 & 5: Fruiting and Harvesting - Pest Situation 2
2 Crop Pest Management Module 6: Insects & Mites: Scenario 4: Soil Insect Pest Management in Open-field Leafy Vegetables: Stage 3, 4 & 5: Vegetative Growth to Fruiting and Harvesting - Conclusion
2 Lesson 4: Exceptions to the rules
2 Module 4: Nutrient deficiencies
2 Unit 5: CSPM; Production and Resources
andrewhancox commented 2 years ago

As discussed, this is not believed to be an issue.