apache / lucene-jira-archive

Jira archive for Apache Lucene
https://lucene.apache.org/
2 stars 6 forks source link

Inlined patches don't migrate correctly #106

Open mikemccand opened 2 years ago

mikemccand commented 2 years ago

While testing my #105 PR for #93, I came across this poorly migrated issue.

It harks back from the CVS days!!

It contains an inline patch, which is not rendered right, and my change in #105 perhaps makes it worse. I'll try to identify inlined patch files and put them in a code block.

mocobeta commented 2 years ago

Ahhhh Perhaps we could apply special treatments (e.g. escape all Markdown syntax) for old issues in the CVS era? According to this Wikipedia article, Markdown was released in 2004, and people are unlikely to use inline patches with Markdown. https://en.wikipedia.org/wiki/Markdown

mocobeta commented 2 years ago

Is it possible to estimate how many issues with inlined patches we have? I guess there are not that many? We can manually fix critical glitches afterward if we know what issues are broken.

mikemccand commented 2 years ago

I ran a simple approximate check, counting how often \n----\n occurs (which may undercount?), and it was small, 19. I think we should do nothing for this?

mocobeta commented 2 years ago

I can fix 19 issues by hand after the migration (there would be no hurry), could you tell me the issue numbers or commands to identify the issues with inline patches?

mikemccand commented 2 years ago

I can fix 19 issues by hand after the migration (there would be no hurry), could you tell me the issue numbers or commands to identify the issues with inline patches?

OK will do -- I'll post my findings here. Thanks @mocobeta.

mikemccand commented 2 years ago

OK I iterated some more on my silly "attempt to detected diff/patch/quoted code" to this VERY scratchy tool:

import os
import re
import glob
import json

reDiffLines = re.compile('^[\d,]+[cd][\d,]+\s*$', re.MULTILINE)
reDiffCommand = re.compile('\s+diff\s+[^\s]+')
reLeadingLessGreaterThan = re.compile(r'^[<>] .*?\);.*?', re.MULTILINE)

results = []

for file_name in glob.glob('jira-dump/*.json'):
    s = open(file_name).read()
    d = json.loads(s)
    jira_id = d['key']
    if not jira_id.startswith('LUCENE'):
        print(f'SKIPPING: {jira_id}')
        continue
    fields = d['fields']
    #print(fields.keys())                                                                                                                                                                                                            
    desc = fields['description']
    #if desc is not None and ('\n---\n' in desc or reDiffLines.search(desc)):                                                                                                                                                        
    if desc is not None:
        count = len(reLeadingLessGreaterThan.findall(desc))
        if reDiffLines.search(desc) or count >= 2:
            #print(f'MATCH: {d["key"]}\n  {desc}')                                                                                                                                                                                   
            #print(f'MATCH: {d["key"]}\n')                                                                                                                                                                                           
            results.append((jira_id, f'*** Diff in desc?: [{jira_id}](https://issues.apache.org/jira/browse/{jira_id})'))
            #print(repr(desc))                                                                                                                                                                                                       
            if jira_id == 'LUCENE-825':
                print(reDiffCommand.match(desc))
                print(repr(desc))
    for comment in fields['comment']['comments']:
        comment_text = comment['body']
        comment_id = comment['id']
        count = len(reLeadingLessGreaterThan.findall(comment_text))
        #if '\n---\n' in comment_text or reDiffLines.search(comment_text):                                                                                                                                                           
        if reDiffLines.search(comment_text) or count > 2:
            #print(f'MATCH: {d["key"]}\n  {comment_text}')                                                                                                                                                                           
            jira_comment_link = f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}'
            results.append((jira_id, f'*** Diff in comment?: [{jira_id}]({jira_comment_link})'))
            if jira_id == 'LUCENE-5400':
                print(reDiffCommand.search(comment_text))
                print(repr(comment_text))
            #print(repr(comment_text))                                                                                                                                                                                               

print(f'\n{len(results)} possible diffs:\n')
for jira_id, text in sorted(results, key=lambda x: int(x[0][7:])):
    print(text)

It produces these results:

9 possible diffs:

Diff in desc?: LUCENE-108 Diff in desc?: LUCENE-162 Diff in desc?: LUCENE-327 Diff in comment?: LUCENE-584 Diff in comment?: LUCENE-743 Diff in desc?: LUCENE-825 Diff in desc?: LUCENE-5110 Diff in comment?: LUCENE-5400 *** Diff in comment?: LUCENE-5934

mikemccand commented 2 years ago

Some of those are properly marked up with Jira's markup, and should migrate well, but others are not.

Honestly, I still feel this is part of the "long tail" of Jira authors writing comments in invalid Jira markup, and we should not aim to correct their mistakes here. Maybe a best effort when we see the errors at most ...

mocobeta commented 2 years ago

Thanks, I added this to the migration plan (or to-do list) in #7.

mocobeta commented 2 years ago

https://github.com/mocobeta/forks-migration-test-2/issues/634#issuecomment-1204765958 also includes an inlined patch.

mocobeta commented 2 years ago

This is actually not a complete "inlined patch," but the root cause is the same here. https://github.com/mocobeta/forks-migration-test-2/issues/631#issuecomment-1204765781

mocobeta commented 2 years ago

I'd leave this as "won't fix" - these issues will be manually corrected afterward with the best effort.

mikemccand commented 2 years ago

+1

mocobeta commented 2 years ago

@mikemccand sorry, I didn't mean we should close this - I think it'd be better to keep open (forever) to show what cannot be done in this migration.

mikemccand commented 2 years ago

Ahh OK +1 then. We should note that in the known limitations of migration. I'll add it to that doc: https://docs.google.com/document/d/10m6--f7vbU9OC_SfANN6vbNDvu25COMm0Jc4wnggNV0/edit#heading=h.njfsizyml9bt