Open petteripitkanen opened 5 years ago
@petteripitkanen I'll take this one.
@petteripitkanen I'm about half way there. I do have one question. I'm going to dump the data in a database. I'm debating between mongodb and postgres. Any thoughts? One plus point with postgres is I can run SQL queries against it. I also don't see an issue with devising a schema.
The main output format should be a plain JS file, the one that is currently used. I don't see benefit of dumping the data to a database, but having multiple output formats seems okay to me.
@petteripitkanen I have the tool ready for use. Note it only does JSON but I'm now adding support for other formats. I just wanted you to get a first look. Any and all feedback welcome. I tried it and the file size is 23M (current one is 17M). I spot checked a few plants and it has all uses. My repo is here: https://github.com/l0n3star/scraper
Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff
is clear enough when the crops are sorted to the same order).
processPlantContent
fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed.binomial
, for example Rosmarinus officinalis, but then title
seems to contain the binomial name. There could be logic for this: if binomial
is not included, fill it with data from title
?functions
is always a string while it should be an array of objects that have the function
property.forage
should also be an array, currently it is a string, and sometimes it contains }}
at the end.Thanks for great feedback !
On Wed, Sep 11, 2019 at 10:44 AM petteripitkanen notifications@github.com wrote:
Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff is clear enough when the crops are sorted to the same order).
- If processPlantContent fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed.
- It looks like some crops don't include property binomial, for example Rosmarinus officinalis https://practicalplants.org/w/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=json&titles=Rosmarinus%20officinalis, but then title seems to contain the binomial name. There could be logic for this: if binomial is not included, fill it with data from title?
- Property functions is always a string while it should be an array of objects that have the function property.
- It looks like forage should also be an array, currently it is a string, and sometimes it contains }} at the end.
- It would be good to double-check if other known array properties are parsed correctly.
- It would be good to double-check if there are more properties that are missing data when compared to raw MediaWiki crop data, maybe there are other properties that we have partly missed before?
- It'd be clearer to have one async function that fetches the crops and returns an array of JS objects, this way the output step could be separated from the fetching step, and it would be easier to add support for different output formats.
- It'd be ok to use a MediaWiki parser library if there is one available that seems suitable, though the parsing logic don't seem to be that complex so I think it is fine to go with fixing the hand-written parser as well.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecohackerfarm/powerplant/issues/83?email_source=notifications&email_token=ALAQO3255YSKLYCXHJY5ZZLQJEVABA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PJ4GQ#issuecomment-530488858, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAQO35BG57DJ7Y2Q4L5UELQJEVABANCNFSM4IMTMGQA .
I did some related debugging in #97, it seems that raw (practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays.
It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step.
Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:
Thank you. I will take a look.
On Wed, Sep 18, 2019 at 9:05 AM petteripitkanen notifications@github.com wrote:
I did some related debugging in #97 https://github.com/Ecohackerfarm/powerplant/pull/97, it seems that raw ( practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays.
It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step.
Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:
- edible part and use
- medicinal part and use
- material part and use
- toxic parts
- functions
- shelter
- forage
- crops
- subspecies
- cultivar groups
- ungrouped cultivars
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecohackerfarm/powerplant/issues/83?email_source=notifications&email_token=ALAQO332W6FFSG45CWVF3DLQKJGTDA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ASZPA#issuecomment-532753596, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAQO36VRS2QF7FNHZ7744LQKJGTDANCNFSM4IMTMGQA .
I have taken a look at available JS parsers, for instance infobox-parser and wtf_wikipedia (this is Wikipedia-specific but it is possible to trick the parser by changing the template name Plant
to Infobox element
), and it seems that none of these handle nested templates properly.
Actually tackling the general case even for a single template expression seems to require a full wikitext parser, and basic wikitext is not the easiest to parse to start with, and then it is interleaved with HTML and whatnot, so the task of parsing wikitext to a data structure is quite complex. MediaWiki itself doesn't seem to have such a parser, only code to convert wikitext to HTML for display.
Our case is a bit easier as the practicalplants.org data for the Plant
template is quite regular with only small amount of HTML and other irregularities. I have preliminary plans for powerplant to be able to use practicalplants.org for automatically populating a local MediaWiki instance, and then synchronizing the local MediaWiki with powerplant, allowing one to edit and browse the crop collection with MediaWiki, so for this I'd like to have a special practicalplants wikitext parser in powerplant.
For this the way to go could be to:
PracticalplantsCrop
dataset in db/practicalplants-data.js
to also contain the raw wikitext of the Plant
template. This could be done in one PR.db/practicalplants-data.js
with the parsed objects.I'll get started on adding raw wikitext.
The extended structure could look like this [{ wikitext: String, object: PracticalplantsCrop }, ...]
, and then there could be two functions in db/practicalplants-data.js
, getCrops
for getting the parsed objects and getWikitextObjectPairs
for getting the whole data structure. With this structure it would be easy to compare wikitext and parsed objects in a diff.
While there don't seem to be a JS parser available that generates a complete AST of the nested template structure, it might be useful to use the partial parses and fill in details, to significantly ease the remaining parsing process.
Input:
{{Plant
|common=Rosemary
|family=Lamiaceae
|primary image=Rosmarinus officinalis.jpg
|forage={{Plant provides forage for|forage=Bees}}
|edible part and use={{Has part with edible use
|part used=Leaves
|part used for=Herbs
}}{{Has part with edible use
|part used=Leaves
|part used for=Dried
}}{{Has part with edible use
|part used=Flowers
|part used for=Salads
}}
|material part and use=
|medicinal part and use=
|sun=full sun
|shade=no shade
|hardiness zone=7
|heat zone=
|water=low
|drought=tolerant
|soil water retention=well drained, moist
|soil texture=sandy, loamy
|soil ph=acid, neutral, alkaline
|wind=No
|maritime=Yes
|native range=South Europe, West Asia
|ecosystem niche=Shrub
|life cycle=perennial
|herbaceous or woody=woody
|deciduous or evergreen=evergreen
|
|fertility=self fertile
|mature measurement unit=metres
|mature height=1.2
|mature width=1.2
|flower colour=blue
|grow from=seed, cutting
|seed requires stratification=No
|seed dormancy depth=
|seed requires scarification=No
|seed requires smokification=No
|cutting type=semi-ripe
|bulb type=
|graft rootstock=
|edible parts=flowers, leaves
|edible uses=Herb, Salad, Dry
}}
Output of infobox-parser(input)
:
{ general:
{ common: 'Rosemary',
family: 'Lamiaceae',
primaryImage: 'Rosmarinus officinalis.jpg',
forage: 'Plant provides forage for',
ediblePartAndUse: 'Has part with edible use',
partUsed: 'Flowers',
partUsedFor: 'Salads',
sun: 'full sun',
shade: 'no shade',
hardinessZone: '7',
water: 'low',
drought: 'tolerant',
soilWaterRetention: 'well drained, moist',
soilTexture: 'sandy, loamy',
soilPh: 'acid, neutral, alkaline',
wind: 'No',
maritime: 'Yes',
nativeRange: 'South Europe, West Asia',
ecosystemNiche: 'Shrub',
lifeCycle: 'perennial',
herbaceousOrWoody: 'woody',
deciduousOrEvergreen: 'evergreen',
fertility: 'self fertile',
matureMeasurementUnit: 'metres',
matureHeight: '1.2',
matureWidth: '1.2',
flowerColour: 'blue',
growFrom: 'seed, cutting',
seedRequiresStratification: 'No',
seedRequiresScarification: 'No',
seedRequiresSmokification: 'No',
cuttingType: 'semi-ripe',
edibleParts: 'flowers, leaves',
edibleUses: 'Herb, Salad, Dry' },
tables: [],
lists: [] }
Output of wtf_wikipedia(input).templates()
:
[ { forage: 'Bees', template: 'plant provides forage for' },
{ 'part used': 'Leaves',
'part used for': 'Herbs',
template: 'has part with edible use' },
{ 'part used': 'Leaves',
'part used for': 'Dried',
template: 'has part with edible use' },
{ 'part used': 'Flowers',
'part used for': 'Salads',
template: 'has part with edible use' },
{ common: 'Rosemary',
family: 'Lamiaceae',
'primary image': 'Rosmarinus officinalis.jpg',
sun: 'full sun',
shade: 'no shade',
'hardiness zone': '7',
water: 'low',
drought: 'tolerant',
'soil water retention': 'well drained, moist',
'soil texture': 'sandy, loamy',
'soil ph': 'acid, neutral, alkaline',
wind: 'No',
maritime: 'Yes',
'native range': 'South Europe, West Asia',
'ecosystem niche': 'Shrub',
'life cycle': 'perennial',
'herbaceous or woody': 'woody',
'deciduous or evergreen': 'evergreen',
list: [ '' ],
fertility: 'self fertile',
'mature measurement unit': 'metres',
'mature height': '1.2',
'mature width': '1.2',
'flower colour': 'blue',
'grow from': 'seed, cutting',
'seed requires stratification': 'No',
'seed requires scarification': 'No',
'seed requires smokification': 'No',
'cutting type': 'semi-ripe',
'edible parts': 'flowers, leaves',
'edible uses': 'Herb, Salad, Dry',
template: 'plant' } ]
Good idea to use partial parses. I will write the tests before completing the parser. This way I'll have clarity.
I could do the parser, it doesn't seem to take that many lines to write a recursive descent parser that produces an AST with the limitation of accepting only inputs where the tokens {{
and }}
can be part of template expressions (and not within HTML constructs).
For now I'd probably accept a PR that extends practicalplants.js to include raw wikitext (as explained on previous comment). Your tool could be useful for fetching the raw wikitexts for this PR.
Sounds fair. I might even pick up the dragon book to understand more on compiler design :)
I found a parsing library called chevrotain. It lets you define a grammar and generates an AST for you. Mind if I try this out or do you still prefer to write your own parser?
I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers.
You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit.
I understand. I think it makes sense for powerplant to be further developed then. Thank you for your extremely valuable feedback on my PR's.
On Tue, Oct 8, 2019 at 6:42 PM petteripitkanen notifications@github.com wrote:
I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers.
You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecohackerfarm/powerplant/issues/83?email_source=notifications&email_token=ALAQO34VV4HLH7ZHPIZDA6LQNUZJFA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAWGSJY#issuecomment-539781415, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAQO367FYC5GYO7DBG5AKDQNUZJFANCNFSM4IMTMGQA .
Create a tool for scraping crop data from practicalplants wiki. Try to divide it into functions that could also be useful for #80. The output format should be like
practicalplants.json
, or a symmetrical plain JS file (#70).Currently
practicalplants.json
is missing some data for the properties edible part and use, medicinal part and use, material part and use, the tool should handle these correctly.The tool should be written in JS but it can be based on the Python tool that was originally used for generating
practicalplants.json
.