fhamborg / Giveme5W1H

Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?
Apache License 2.0
505 stars 87 forks source link

TypeError: unorderable types: int() < str() #38

Closed lastshogun closed 4 years ago

lastshogun commented 4 years ago

Describe the bug Hi, There is an error occurs when I was applying the 5W1H extractor on my JSON news dataset.

The error occurs at evaluate_location file when it tried to run "raw_locations.sort(key=lambda x: x[1], reverse=True)", then the console gave the error says"TypeError: unorderable types: int() < str()".

My question is: Does this means something wrong with my dataset format? But if so shouldn't it consider all the news data as a simple long string when the extractor work on this corpus? I'm eagerly looking for a solution to this problem.

Log Traceback (most recent call last): File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractor.py", line 20, in run extractor.process(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/abs_extractor.py", line 41, in process self._evaluate_candidates(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py", line 75, in _evaluate_candidates locations = self._evaluate_locations(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py", line 224, in _evaluate_locations raw_locations.sort(key=lambda x: x[1], reverse=True) TypeError: unorderable types: int() < str()

Here is one of the news json news file that caused an error when I tried to analyse it with Giveme5w1h.

{ "title": "Martian rock named for Rolling StonesRolling Stones get name on little Martian rock that rolled", "body": "PASADENA, Calif. - There is now a Rolling Stones Rock on Mars, and it's giving Mick, Keith and the boys some serious satisfaction.NASA named a little stone for the legendary rockers after its InSight robotic lander captured it rolling across the surface of Mars last year, and the new moniker was made public at Thursday night's Rolling Stones' concert at the Rose Bowl.NASA has given us something we have always dreamed of, our very own rock on Mars. I can't believe it, Mick Jagger told the crowd after grooving through a rendition of Tumbling Dice. I want to bring it back and put it on our mantelpiece.Robert Downey Jr. announced the name, taking the stage just before the band's set at the Southern California stadium that is just a stone's throw from NASA's Jet Propulsion Laboratory, which manages InSight.Cross-pollinating science and a legendary rock band is always a good thing, the Iron Man actor said backstage.He told the crowd that JPL scientists had come up with the name in a fit of fandom and clever association.Charlie, Ronnie, Keith and Mick - they were in no way opposed to the notion, Downey said, but in typical egalitarian fashion, they suggested I assist in procuring 60,000 votes to make it official, so that's my mission.He led the audience in a shout of aye before declaring the deed done.Jagger later said, I want to say a special thanks to our favorite action man Robert Downey Jr. That was a very nice intro he gave.The rock, just a little bigger than a golf ball, was moved by InSight's own thrusters as the robotic lander touched down on Mars on Nov. 26.It only moved about 3 feet, but that's the farthest NASA has seen a rock roll while landing a craft on another planet.I've seen a lot of Mars rocks over my career, Matt Golombek, a JPL geologist who has helped NASA land all its Mars missions since 1997, said in a statement. This one probably won't be in a lot of scientific papers, but it's definitely one of the coolest.The Rolling Stones and NASA logos were shown side by side in the run-up to the show as the sun set over the Rose Bowl, leaving many fans perplexed as to what the connection was before it was announced.The concert had originally been scheduled for spring, before the Stones postponed their No Filter North American tour because Jagger had heart surgery.", "published_at": "2019-08-24", }

Versions (please complete the following information):

fhamborg commented 4 years ago

could you please clone the repo and debug to find what the content (and types) of raw_locations is? that would be very helpful, thanks.

lastshogun commented 4 years ago

This is all elements in raw_location of one news that cause the error, It seems every element is multi-dimensional data type, I'm not sure how does it manage to compare with each other.

[[{'before': ' ', 'originalText': 'Milan', 'characterOffsetEnd': 331, 'pos': 'NNP', 'word': 'Milan', 'lemma': 'Milan', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 32, 'characterOffsetBegin': 326}], '198722082', Point(45.4667971, 9.1904984, 0.0), [45.3867381, 45.5358482, 9.0408867, 9.2781103], 307144500, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac64c88>, 0] [[{'before': ' ', 'originalText': 'England', 'characterOffsetEnd': 426, 'pos': 'NNP', 'word': 'England', 'lemma': 'England', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 1, 'characterOffsetBegin': 419}], '198282901', Point(52.7954791, -0.540240286617432, 0.0), [49.674, 55.917, -6.7047494, 2.0919117], 439166113890, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac03fd0>, 0] [[{'before': ' ', 'originalText': 'United', 'characterOffsetEnd': 516, 'pos': 'NNP', 'word': 'United', 'lemma': 'United', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 19, 'characterOffsetBegin': 510}, {'before': ' ', 'originalText': 'States', 'characterOffsetEnd': 523, 'pos': 'NNPS', 'word': 'States', 'lemma': 'States', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 20, 'characterOffsetBegin': 517}], '197597979', Point(39.7837304, -100.4458825, 0.0), [-14.7608358, 71.6048217, -180.0, 180.0], 0, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac03f98>, 0] [[{'before': ' ', 'originalText': 'Morgan.Manchester', 'characterOffsetEnd': 587, 'pos': 'NNP', 'word': 'Morgan.Manchester', 'lemma': 'Morgan.Manchester', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 29, 'characterOffsetBegin': 570}, {'before': ' ', 'originalText': 'City', 'characterOffsetEnd': 592, 'pos': 'NNP', 'word': 'City', 'lemma': 'City', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 30, 'characterOffsetBegin': 588}], 70020179, Point(42.9950113, -71.4885476, 0.0), [42.993092, 42.9973798, -71.488662, -71.488421], 9044, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac030f0>, 0] [[{'before': ' ', 'originalText': 'Tottenham', 'characterOffsetEnd': 687, 'pos': 'NNP', 'word': 'Tottenham', 'lemma': 'Tottenham', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 45, 'characterOffsetBegin': 678}], '145844', Point(51.5976955, -0.0672892, 0.0), [51.5776955, 51.6176955, -0.0872892, -0.0472892], 12291508, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13f61227b8>, 0] [[{'before': ' ', 'originalText': 'England', 'characterOffsetEnd': 739, 'pos': 'NNP', 'word': 'England', 'lemma': 'England', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 6, 'characterOffsetBegin': 732}], '198282901', Point(52.7954791, -0.540240286617432, 0.0), [49.674, 55.917, -6.7047494, 2.0919117], 439166113890, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e2b0>, 0] [[{'before': ' ', 'originalText': 'USA', 'characterOffsetEnd': 821, 'pos': 'NNP', 'word': 'USA', 'lemma': 'USA', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 25, 'characterOffsetBegin': 818}], '197597979', Point(39.7837304, -100.4458825, 0.0), [-14.7608358, 71.6048217, -180.0, 180.0], 0, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e128>, 0] [[{'before': ' ', 'originalText': 'Netherlands', 'characterOffsetEnd': 899, 'pos': 'NNP', 'word': 'Netherlands', 'lemma': 'Netherlands', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 37, 'characterOffsetBegin': 888}], '198097880', Point(52.2379891, 5.53460738161551, 0.0), [11.825, 53.7253321, -68.6255319, 7.2274985], 38321668956550, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e0b8>, 0] [[{'before': '', 'originalText': 'Orlando', 'characterOffsetEnd': 1133, 'pos': 'NNP', 'word': 'Orlando', 'lemma': 'Orlando', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 79, 'characterOffsetBegin': 1126}], '198153254', Point(28.5421097, -81.3790388, 0.0), [28.3480634, 28.614283, -81.5075377, -81.2275862], 810976392, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e0f0>, 0] [[{'before': '', 'originalText': 'Manchester', 'characterOffsetEnd': 1211, 'pos': 'NNP', 'word': 'Manchester', 'lemma': 'Manchester', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 96, 'characterOffsetBegin': 1201}, {'before': ' ', 'originalText': 'City', 'characterOffsetEnd': 1216, 'pos': 'NNP', 'word': 'City', 'lemma': 'City', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 97, 'characterOffsetBegin': 1212}], '197628487', Point(53.4791301, -2.2441009, 0.0), [53.3401207, 53.5446042, -2.3198967, -2.1468278], 261248130, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e278>, 0] [[{'before': '', 'originalText': 'USA', 'characterOffsetEnd': 1305, 'pos': 'NNP', 'word': 'USA', 'lemma': 'USA', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 118, 'characterOffsetBegin': 1302}], '197597979', Point(39.7837304, -100.4458825, 0.0), [-14.7608358, 71.6048217, -180.0, 180.0], 0, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e160>, 0] [[{'before': '', 'originalText': 'England', 'characterOffsetEnd': 1329, 'pos': 'NNP', 'word': 'England', 'lemma': 'England', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 124, 'characterOffsetBegin': 1322}], '198282901', Point(52.7954791, -0.540240286617432, 0.0), [49.674, 55.917, -6.7047494, 2.0919117], 439166113890, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e198>, 0] [[{'before': '', 'originalText': 'Netherlands', 'characterOffsetEnd': 1359, 'pos': 'NNP', 'word': 'Netherlands', 'lemma': 'Netherlands', 'speaker': 'PER0', 'after': '', 'ner': 'LOCATION', 'index': 130, 'characterOffsetBegin': 1348}], '198097880', Point(52.2379891, 5.53460738161551, 0.0), [11.825, 53.7253321, -68.6255319, 7.2274985], 38321668956550, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e1d0>, 0] [[{'before': '', 'originalText': 'Paris', 'characterOffsetEnd': 1403, 'pos': 'NNP', 'word': 'Paris', 'lemma': 'Paris', 'speaker': 'PER0', 'after': ' ', 'ner': 'LOCATION', 'index': 139, 'characterOffsetBegin': 1398}], '198006226', Point(48.8566101, 2.3514992, 0.0), [48.8155755, 48.902156, 2.224122, 2.4697602], 173141595, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f13cac1e208>, 0]

fhamborg commented 4 years ago

I cannot reproduce the issue at hand using the article you posted above. This is how I run the code:

import logging

from Giveme5W1H.extractor.document import Document
from Giveme5W1H.extractor.extractor import MasterExtractor

"""
This is a simple example how to use the extractor in combination with a dict in news-please format.

- Nothing is cached

"""

# don`t forget to start up core_nlp_host
# giveme5w1h-corenlp

title = "Martian rock named for Rolling StonesRolling Stones get name on little Martian rock that rolled. "
lead = ""
text = """PASADENA, Calif. - There is now a Rolling Stones Rock on Mars, and it's giving Mick, Keith and the boys some serious satisfaction.NASA named a little stone for the legendary rockers after its InSight robotic lander captured it rolling across the surface of Mars last year, and the new moniker was made public at Thursday night's Rolling Stones' concert at the Rose Bowl.NASA has given us something we have always dreamed of, our very own rock on Mars. I can't believe it, Mick Jagger told the crowd after grooving through a rendition of Tumbling Dice. I want to bring it back and put it on our mantelpiece.Robert Downey Jr. announced the name, taking the stage just before the band's set at the Southern California stadium that is just a stone's throw from NASA's Jet Propulsion Laboratory, which manages InSight.Cross-pollinating science and a legendary rock band is always a good thing, the Iron Man actor said backstage.He told the crowd that JPL scientists had come up with the name in a fit of fandom and clever association.Charlie, Ronnie, Keith and Mick - they were in no way opposed to the notion, Downey said, but in typical egalitarian fashion, they suggested I assist in procuring 60,000 votes to make it official, so that's my mission.He led the audience in a shout of aye before declaring the deed done.Jagger later said, I want to say a special thanks to our favorite action man Robert Downey Jr. That was a very nice intro he gave.The rock, just a little bigger than a golf ball, was moved by InSight's own thrusters as the robotic lander touched down on Mars on Nov. 26.It only moved about 3 feet, but that's the farthest NASA has seen a rock roll while landing a craft on another planet.I've seen a lot of Mars rocks over my career, Matt Golombek, a JPL geologist who has helped NASA land all its Mars missions since 1997, said in a statement. This one probably won't be in a lot of scientific papers, but it's definitely one of the coolest.The Rolling Stones and NASA logos were shown side by side in the run-up to the show as the sun set over the Rose Bowl, leaving many fans perplexed as to what the connection was before it was announced.The concert had originally been scheduled for spring, before the Stones postponed their No Filter North American tour because Jagger had heart surgery.
"""
date_publish = '2019-08-24'

if __name__ == '__main__':
    # logger setup
    log = logging.getLogger('GiveMe5W')
    log.setLevel(logging.DEBUG)
    sh = logging.StreamHandler()
    sh.setLevel(logging.DEBUG)
    log.addHandler(sh)

    # giveme5w setup - with defaults
    extractor = MasterExtractor()
    doc = Document.from_text(title + lead + text, date_publish)

    doc = extractor.parse(doc)

    top_who_answer = doc.get_top_answer('who').get_parts_as_text()
    top_what_answer = doc.get_top_answer('what').get_parts_as_text()
    top_when_answer = doc.get_top_answer('when').get_parts_as_text()
    top_where_answer = doc.get_top_answer('where').get_parts_as_text()
    top_why_answer = doc.get_top_answer('why').get_parts_as_text()
    top_how_answer = doc.get_top_answer('how').get_parts_as_text()

    print(top_who_answer)
    print(top_what_answer)
    print(top_when_answer)
    print(top_where_answer)
    print(top_why_answer)
    print(top_how_answer)
lastshogun commented 4 years ago

Thanks for replying, I just test out with your code you provide above but it still gives the same problem down below. If you have any idea or assumption that can fix this problem, I'll be much appreciated. My research requires this library to extract time and location information, so it is kind of stuck on this step right now. Here I attach the log when I simply apply your test code on my device.

Exception in thread Thread-5: Traceback (most recent call last): File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractor.py", line 20, in run extractor.process(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/abs_extractor.py", line 41, in process self._evaluate_candidates(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py", line 75, in _evaluate_candidates locations = self._evaluate_locations(document) File "/usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py", line 228, in _evaluate_locations raw_locations.sort(key=lambda x: x[1], reverse=True) TypeError: unorderable types: str() < int()

fhamborg commented 4 years ago

I think this is a pretty straight forward bug, which you can easily fix yourself (unfortunately, I cannot, since I cannot reproduce this). When you have a fix, I'd like to kindly ask you to open a pull request, so that I can also merge it into this repo and other users benefit from it. What you need to do is to start the program in an IDE in Debug and see, what is in the respective entry of raw_locations that makes the process break (it seems to be a str, but it should be an int).

Alternatively, if you don't want to debug this or do not have an IDE installed: you can also insert a new line just before line 228 in /usr/local/lib/python3.5/dist-packages/Giveme5W1H/extractor/extractors/environment_extractor.py, which prints the contents of the respective "column" in raw_locations, e.g., something like: print([x[1] for x in raw_locations])

lastshogun commented 4 years ago

I'd like to fix the bug. Sorry if this is bothering but, in my second comment I've printed out and posted the elements in the list "raw_locations" using the method you just suggested. As you can see it contains multiple factors in each element in the list, and since it's neither a str or a int format, then I don't understand why it is comparable(or to be sorted). Besides, since lots of my news articles in the dataset occurs this bug when I try to apply the Giveme5w1h method, then I think maybe there's some mechanic designed in the code cause this problem.(Which means if I can find out what's wrong with this piece of news, the bug may keep happening when it tries to process the rest of the news corpus.) If there's any misunderstanding about this, please let me know. PS. Because my 2nd comment format is a bit messy so I post it again here:

`# sort locations based id

    for x in range(len(raw_locations)):
        print (raw_locations[x]) 

    raw_locations.sort(key=lambda x: x[1], reverse=True)`

[[{'index': 1, 'lemma': 'PASADENA', 'speaker': 'PER0', 'word': 'PASADENA', 'originalText': 'PASADENA', 'characterOffsetEnd': 109, 'pos': 'NNP', 'after': '', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 101}, {'index': 2, 'lemma': ',', 'speaker': 'PER0', 'word': ',', 'originalText': ',', 'characterOffsetEnd': 110, 'pos': ',', 'after': ' ', 'before': '', 'ner': 'O', 'characterOffsetBegin': 109}, {'index': 3, 'lemma': 'Calif.', 'speaker': 'PER0', 'word': 'Calif.', 'originalText': 'Calif.', 'characterOffsetEnd': 117, 'pos': 'NNP', 'after': ' ', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 111}], 199274586, Point(34.1476452, -118.1444779, 0.0), [34.1172023, 34.251905, -118.198139, -118.065479], 182911336, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199bcec50>, 0] [[{'index': 13, 'lemma': 'Mars', 'speaker': 'PER0', 'word': 'Mars', 'originalText': 'Mars', 'characterOffsetEnd': 162, 'pos': 'NNP', 'after': '', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 158}], '198791424', Point(44.0009649, 3.5560447, 0.0), [43.9912494, 44.0146261, 3.535728, 3.5664698], 6390941, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199bcec88>, 0] [[{'index': 48, 'lemma': 'Mars', 'speaker': 'PER0', 'word': 'Mars', 'originalText': 'Mars', 'characterOffsetEnd': 362, 'pos': 'NNP', 'after': ' ', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 358}], '198791424', Point(44.0009649, 3.5560447, 0.0), [43.9912494, 44.0146261, 3.535728, 3.5664698], 6390941, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199bcecc0>, 0] [[{'index': 86, 'lemma': 'Mars', 'speaker': 'PER0', 'word': 'Mars', 'originalText': 'Mars', 'characterOffsetEnd': 551, 'pos': 'NNP', 'after': '', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 547}], '198791424', Point(44.0009649, 3.5560447, 0.0), [43.9912494, 44.0146261, 3.535728, 3.5664698], 6390941, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199b8c9e8>, 0] [[{'index': 30, 'lemma': 'Southern', 'speaker': 'PER0', 'word': 'Southern', 'originalText': 'Southern', 'characterOffsetEnd': 804, 'pos': 'NNP', 'after': ' ', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 796}, {'index': 31, 'lemma': 'California', 'speaker': 'PER0', 'word': 'California', 'originalText': 'California', 'characterOffsetEnd': 815, 'pos': 'NNP', 'after': ' ', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 805}], '239967342', Point(34.0224149, -118.286344073446, 0.0), [34.018385, 34.0254826, -118.2914551, -118.2801179], 823716, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199b380f0>, 0] [[{'index': 34, 'lemma': 'Mars', 'speaker': 'PER0', 'word': 'Mars', 'originalText': 'Mars', 'characterOffsetEnd': 1675, 'pos': 'NNP', 'after': ' ', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 1671}], '198791424', Point(44.0009649, 3.5560447, 0.0), [43.9912494, 44.0146261, 3.535728, 3.5664698], 6390941, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199b38f98>, 0] [[{'index': 44, 'lemma': 'Rose', 'speaker': 'PER0', 'word': 'Rose', 'originalText': 'Rose', 'characterOffsetEnd': 2171, 'pos': 'NNP', 'after': ' ', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 2167}, {'index': 45, 'lemma': 'Bowl', 'speaker': 'PER0', 'word': 'Bowl', 'originalText': 'Bowl', 'characterOffsetEnd': 2176, 'pos': 'NNP', 'after': '', 'before': ' ', 'ner': 'LOCATION', 'characterOffsetBegin': 2172}], 82354690, Point(50.9240979, -1.32204360262812, 0.0), [50.9233946, 50.9248456, -1.3231991, -1.3208924], 25921, 0, 0, <Giveme5W1H.extractor.candidate.Candidate object at 0x7f1199b38fd0>, 0]

fhamborg commented 4 years ago

As said in my most recent post, please post the result of print([x[1] for x in raw_locations]), which will only show the relevant values. Thanks

lastshogun commented 4 years ago

Thanks. Here is the result: [199274586, '198791424', '198791424', '198791424', '239967342', '198791424', 82354690]

lastshogun commented 4 years ago

I tried to convert the string to int by myself, but the input contains functions and object addresses - it is therefore hard to reproduce what I want to do. Could you please share any advice?

fhamborg commented 4 years ago

should be fixed in giveme5w1h=1.0.17, which you can find on pypi. the issue was caused by nominatim, which seems to returned the location IDs in strings in a few cases, whereas they should be int always.