Some issue attachment links still do not work

GoogleCodeExporter commented 9 years ago

Sorry for being picky about this.

I have 249 attachments in my GoogleCodeProjectHosting.json.

Using the info from issue 39 (comment indices are 0-based) I managed to obtain 
links to 122 attachments (listed in good-links.txt).

Apparently another 112 of the attachments actually use 1-based indices in the 
links  (listed in bad-fixable.txt).

Yet there are still 15 attachment links that *do not work at all* (listed in 
bad-unfixable.txt).

See the attached files for the links. I used the projects tint2 and pmn.

--

As a workaround I tried to use directly the links visible on the Google Code 
webpage, which look like this:

https://pmn.googlecode.com/issues/attachment?aid=10004000&name=magic&token=remov
edthesecrettoken

and download all of them to my computer. However I *have to* include the secret 
token parameter in the URL otherwise the script gets redirected to a Sign In 
page which it cannot handle (as it is not a browser). It is not clear how to 
obtain that token.

Finally, I tried copy-pasting the token from the browser to my script but 
apparently a separate token is required for each attachment, so my conclusion 
is that it is impossible to automate this process except maybe with a browser 
extension that crawls the whole issue tracker.

Original issue reported on code.google.com by mrovi9...@gmail.com on 18 Mar 2015 at 12:52

Attachments:

GoogleCodeExporter commented 9 years ago

And I am not even sure that I have really "fixed" some of the bad links, as I 
sometimes see colliding file names in different comments like this:

User comment 1: Something doesn't work, attaches config.txt
User comment 2: Workaround doesn't work either, attaches config.txt

Original comment by mrovi9...@gmail.com on 18 Mar 2015 at 12:57

GoogleCodeExporter commented 9 years ago

Thank you for the bug report. I will take a look.

Original comment by chrsm...@google.com on 18 Mar 2015 at 3:21

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Any update? We're looking to add an "Import from Google Code" feature to GitLab 
and want to import attachments as well, but this issue is standing in our way.

Thank you.

Original comment by do...@gitlab.com on 30 Mar 2015 at 9:16

GoogleCodeExporter commented 9 years ago

We have added a wiki page detailing how our issue attachment mirroring works. 
You can get issues without the additional token if you go straight to Google 
Cloud Storage.

See: https://code.google.com/p/support-tools/wiki/IssueMirror

However our issue mirroring does have some known issues (we won't fix). We only 
mirror _public_ issues. So attachments for private (Restrict-View-*) issues 
will not be available. Similarly, we don't mirror attachments for deleted 
issues. And, in the occasional situation where you upload two attachments at 
the same commit, both having the same name, we only have one of the files on 
GCS.

As for your bad-unfixable.txt file, you have uncovered something afoot in 
Google Code. For example:

Bad link: 
https://storage.googleapis.com/google-code-attachments/tint2/issue-471/comment-3
4/fedora-2015-03-15T10-43-31-220064000Z.webm for issue 471, comment 35, 
attachment {"mimetype": "application/octet-stream", "attachmentId": 
"4710035000", "fileSize": 10148504, "fileName": 
"fedora-2015-03-15T10-43-31-220064000Z.webm"}

The attachment file is actually found in our mirror, though as comment #31:
https://storage.googleapis.com/google-code-attachments/tint2/issue-471/comment-3
1/fedora-2015-03-15T10-43-31-220064000Z.webm

I have no explanation for the discrepancy, other than I probably wrote the 
errant code.

Could you tell me more about how your "Import from Google Code" feature of 
GitLab works? I take it GitLab supports arbitrary file attachments, so you need 
to download the attachments at the time you do the import? If so, that's a 
great feature. But note that there is a delay between when issue attachments 
are uploaded to Google Code and when they are mirrored onto Google Cloud 
Storage.

Original comment by chrsm...@google.com on 31 Mar 2015 at 12:11

GoogleCodeExporter commented 9 years ago

The known restrictions to the IssueMirror are not a problem.

As for migrating public attachments, we have two options, since GitLab does 
indeed support arbitrary file attachments:

1. Download all attachments from the IssueMirror and reupload to GitLab when a 
user requests a project import. 
2. Link directly to the attachment on the IssueMirror from the new GitLab issue.

Both options depend on the IssueMirror URLs actually working, which 
mrovi9...@gmail.com reports they aren't currently. The first option also 
requires that we aren't rate limited or otherwise blocked by Google Cloud 
Storage for downloading large numbers of files. The second option is only 
viable if there is a guarantee that the IssueMirror will stay up indefinitely. 

Downsides to the first option would be the storage and bandwidth requirements 
on our side, and the mirror delay you mention (how large is that delay?) The 
main downside to the second option would be the ongoing dependency on Google 
Storage :)

We are ultimately fine with either option, but we need to be sure the 
IssueMirror URLs work.

Original comment by do...@gitlab.com on 31 Mar 2015 at 12:56

GoogleCodeExporter commented 9 years ago

I think you should go with option two, if for no other reason than its 
simplicity. As for your concerns about how long the issue mirror will stay 
around, it simply is a Google Cloud Storage bucket. So the overhead is 
negligible. Read: if I get hit by a bus, the data will still be there. And 
unlike other parts of Google Code, issue attachments aren't as problematic of 
an abuse vector.

As for the delay in mirroring, at worst it will be a few days. Due to come 
technical limitations in how to bridge security from our internal data centers 
and the external Google Cloud Storage, I need to run the migration manually.

I will be looking at those `bad-unfixable.txt` attachments today, as it 
certainly is a bug somewhere. I'll update this issue when I've hunted it down...

Original comment by chrsm...@google.com on 1 Apr 2015 at 7:45

GoogleCodeExporter commented 9 years ago

All right, thank you Chris. I wasn't sure if this bucket was meant to be 
permanent or if the plan was to let it go after some amount of time had passed, 
but I guess there's really no point to since the amount of data is negligible 
in the grander scheme of things.

Good luck with `bad-unfixable.txt`, I hope you figure it out.

Original comment by do...@gitlab.com on 1 Apr 2015 at 9:43

GoogleCodeExporter commented 9 years ago

In case you were curious, the underlying issue has to do with deleted comments.

We were mirroring the issue as comment #31, but displaying the attachment on 
the site as comment #35. That's because before we render the web page we go 
through _all_ comments. (Including those that have been deleted[1].) Whereas 
the issue mirror just goes through "live" comments.

This mismatch is why the attachment was put in the wrong place. There were four 
deleted comments before the attachment was put up. So while Google Code says 
you are looking at comment #35, in actuality is is only the 31st LIVE comment.

See the following comments, notice that the one after it isn't shown. For 
example, there is a comment #10 and #12, but no #11.

https://code.google.com/p/tint2/issues/detail?id=471#c10
https://code.google.com/p/tint2/issues/detail?id=471#c19
https://code.google.com/p/tint2/issues/detail?id=471#c27
https://code.google.com/p/tint2/issues/detail?id=471#c30

Fixing it might be a pain, but at least we know what the problem is.

[1] Deleting data in large-scale replicated datastores is actually difficult. 
So many times things get deleted by simply clearing a "LIVE" field, and 
possibly zeroing out the data; but still leaving the placeholder object in 
place. This way you don't also have to move lots of data around since you now 
have an XX byte hole inside of a YY Gigabyte file.

Original comment by chrsm...@google.com on 1 Apr 2015 at 10:01

GoogleCodeExporter commented 9 years ago

So I'm assuming you're planning to actually fix the data in the bucket? It 
would be easiest to simply change the docs to say the ID is based on how many 
live comments come before it, but I guess that will mess up the numbering when 
a comment is deleted after an attachment is mirrored.

Original comment by do...@gitlab.com on 1 Apr 2015 at 10:13

GoogleCodeExporter commented 9 years ago

Thanks for the explanation. I can confirm that with the attached script I do 
not see any bad URLs in project tint2 :D

IIUC things may still go out of sync if someone deletes a comment after you run 
the mirroring script but before the takeout is generated; and they will become 
in sync again the next time you run the mirroring script?

Original comment by mrovi9...@gmail.com on 1 Apr 2015 at 10:49

Attachments:

test-attachments-gcs.py

GoogleCodeExporter commented 9 years ago

Looking at how we surface the data (HTML frontend, Google Takeout JSON dump, 
and the GitHub exporter) it seems like the thing we need to fix is the issue 
mirror's counting schema.

That, unfortunately might require some major changes because of how that system 
works.

#Summary#

Google Code comments can be deleted. In the HTML frontend, we number comments 
including these deleted comments. Similarly, in the Google Code takeout we 
include deleted comments in the data dump.

The problem is that in the Issue Mirror we ignore deleted comments, so when we 
render a link to issue X, comment Y. "Y" refers to the Yth LIVE comment. Not 
the Yth comment overall.

As a workaround, you will need to filter out non-LIVE comments in from your 
Google Takeout dump, as @mrovi9000 did in their script.

Similarly, douwe@gitlab.com, you will need to take note of the number of 
comments you see when scraping Google Code output; and not look at the comment 
numbers we display in the HTML.

Re: "things may still go out of sync if someone deletes a comment after you run 
the mirroring script but before the takeout is generated"

Correct. When the mirroring process runs, it will upload an attachment to 
something like .../tint2/issue-X/comment-10/... Now if a comment BEFORE the 
attachment comment gets deleted, AND you then run Google Takeout to export your 
project issues, THEN the comment numbers won't line up.

Re: "will they become in sync again the next time you run the mirroring script?"

No. Currently, the code that does the mirroring exists in a world that doesn't 
know anything about deleted comments. So as far as it knows, the comment 
numbers it generates are correct.

So it seems like getting the Issue Mirror to be aware of deleted comments will 
fix all known problems, so that you can just link to comment X (where X 
includes both LIVE and deleted issues).

I'll start looking into how we can fix this. Sorry for the inconvenience! Once 
I get the Issue Mirror attachments to include counts from deleted issues, you 
should just be able to generate links to Google Cloud Storage as you would 
expect. (Either using the comment numbers from our HTML, or directly from the 
Google Takeout JSON dump.)

Original comment by chrsm...@google.com on 2 Apr 2015 at 12:01

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Thanks for looking into this! We're planning to include Google Code import in 
GitLab 7.10, due to be released April 22nd with code freeze around the 14th. 
I'm hopeful this IssueMirror issue will be fixed before then.

As an aside, is there a reason why the attachment URLs aren't included in the 
Takeout dump directly? That way the specific way of counting wouldn't have 
mattered anyway, as long as the right JSON attachment object had the right URL.

Original comment by do...@gitlab.com on 2 Apr 2015 at 8:03

GoogleCodeExporter commented 9 years ago

As of right now all attachments on Google Code should exist on Google Cloud 
Storage with the right comment ID. (This impacted < 5% of all issues mirrored.) 
If you see any problems please let me know.

I'll update the IssueMirror FAQ to clarify this, but the issue mirror now 
exports issues to match the exact comment number that you see on the site. So 
the attachment at comment #X should be on Google Cloud Storage with comment-X.

If any files are attached to an issue when it is initially reported, that is 
considered "comment #0".

This also applies to Google Takeout JSON dumps. The "id" property of each 
comment should correspond to the Google Cloud Storage bucket number.

Re: "is there a reason why the attachment URLs aren't included in the Takeout 
dump directly?"

The issue mirror didn't exist at the time we wrote the Google Takeout support. 
As for why not to add direct download links now, it's a known issue. I just 
haven't been able to get to it yet.

Re: "We're planning to include Google Code import in GitLab 7.10, due to be 
released April 22nd with code freeze around the 14th."

Sounds great. Please contact me (chrsmith@google.com) if there is anything I 
can do for you. If you sent me a link or doc with how the system works, I'd be 
happy to mention it in the project's wiki.

Original comment by chrsm...@google.com on 2 Apr 2015 at 9:50

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks Chris, I'll let you know if we need anything else.

Original comment by do...@gitlab.com on 3 Apr 2015 at 9:21

EDICOGNE / support-tools

Some issue attachment links still do not work #50