Error parsing Apache's Jira

rodrigokuroda commented 10 years ago

I'm trying to mining some Apache's JIRA projects and I had this error:

$ bicho --db-user-out operator --db-password-out operator --db-database-out hadoop -d 15 -b jira -u "https://issues.apache.org/jira/browse/HADOOP"

Checking URL: https://issues.apache.org
Running Bicho with delay of 15 seconds
Tickets to be retrieved: 8839
Error parsing URL: https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?pid=HADOOP&sorter/field=updated&sorter/order=INC&tempMax=500&pager/start=0
Traceback (most recent call last):
  File "/usr/local/bin/bicho", line 25, in <module>
    retval = bicho.main.main()
  File "/usr/local/lib/python2.7/dist-packages/bicho/main.py", line 56, in main
    backend.run()
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 935, in run
    self.analyze_bug_list(issues_per_xml_query, bugs_number - remaining, bugsdb, dbtrk.id)
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 882, in analyze_bug_list
    self.safe_xml_parse(url_issues, handler)
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 869, in safe_xml_parse
    parser2.feed(cleaned_contents)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 214, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.7/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:39:2: mismatched tag

sduenas commented 10 years ago

I've reproduced the error. When trying to get the list of issues, JIRA server returns a 403 Forbidden

sduenas@Guybrush:~$ wget "https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?pid=HADOOP&sorter/field=updated&sorter/order=INC&tempMax=500&pager/start=0"
--2014-07-04 13:46:14--  https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?pid=HADOOP&sorter/field=updated&sorter/order=INC&tempMax=500&pager/start=0
Resolving issues.apache.org (issues.apache.org)... 140.211.11.121
Connecting to issues.apache.org (issues.apache.org)|140.211.11.121|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2014-07-04 13:46:15 ERROR 403: Forbidden.

sduenas commented 10 years ago

The error is raised due to tempMax URL parameter. It sets the maximum number of issues returned on each response. Setting this parameter to a lower value (200 seems to be the upper limit in Apache's JIRA), the list of issues is returned without problems.

sduenas@Guybrush:~$ wget "https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?pid=HADOOP&sorter/field=updated&sorter/order=INC&tempMax=200&pager/start=0"
--2014-07-04 14:03:08--  https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?pid=HADOOP&sorter/field=updated&sorter/order=INC&tempMax=200&pager/start=0
Resolving issues.apache.org (issues.apache.org)... 140.211.11.121
Connecting to issues.apache.org (issues.apache.org)|140.211.11.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: 'SearchRequest.xml?pid=HADOOP&sorter%2Ffield=updated&sorter%2Forder=INC&tempMax=200&pager%2Fstart=0.2'

    [                 <=>                                                                                                                                                                    ] 1,527,102    377KB/s   in 4.0s   

2014-07-04 14:03:14 (377 KB/s) - 'SearchRequest.xml?pid=HADOOP&sorter%2Ffield=updated&sorter%2Forder=INC&tempMax=200&pager%2Fstart=0.2' saved [1527102]

I'm going to add a new command-line parameter to set this limit.

rodrigokuroda commented 10 years ago

Ok, @sduenas! Thanks for your feedback!

sduenas commented 10 years ago

Fixed by commit 779f707.

@rodrigokuroda, please update you local repo with the latest changes and run Bicho setting the new paramater --num-issues (-n) to 200.

For instance:

bicho --num-issues 200 -g --db-user-out xxxxxxx --db-password-out xxxxxxx --db-database-out bicho -d 15 -b jira -u "https://issues.apache.org/jira/browse/HADOOP"

It should work.

rodrigokuroda commented 10 years ago

@sduenas now it's working, thank you!

rodrigokuroda commented 10 years ago

But... I had another error.

$ bicho --db-user-out operator --db-password-out operator --db-database-out hadoop -n 200 -d 15 -b jira -u "https://issues.apache.org/jira/browse/HADOOP"

Tickets to be retrieved: 8844
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 886, in analyze_bug_list
    issues = handler.getIssues()
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 750, in getIssues
    bicho_bugs.append(self.getIssue(bug))
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 793, in getIssue
    changes = parser.parse_changes()
  File "/usr/local/lib/python2.7/dist-packages/bicho/backends/jira.py", line 393, in parse_changes
    author_url = span_link['rel']
  File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 613, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'rel'

sduenas commented 10 years ago

This new error is fixed by commit c4bbc37

@rodrigokuroda please check it again. I've been running again the tool, retrieving info from almost 600 issues and it seems to work.

rodrigokuroda commented 9 years ago

Now it working. Thanks @sduenas!

MetricsGrimoire / Bicho

Error parsing Apache's Jira #130