Open Woseseltops opened 1 month ago
@rem0g , would this make your life better?
@susanodd , can you think of other drawbacks?
I agree the check for h264 is excessive. It was more a check that the user didn't put say, PDF files in with videos. But if we stick with "try-except" with friendly error reporting for when there was a problem, then this could be programmed away, so not to "check before" but just fail.
The zip uploads them to a different location because the normal upload of a video creates an object at the same time.
My opinion is that we should turn off reversion. (It will create a mess if it stays, because it does this every time you upload a video and then moves files around.) I would just remove it entirely and just overwrite an existing video file. Invest more time in saving the writable archive instead.
For the perspective videos, I made it so it just "overwrites" an existing 'left' or 'right' file and leaves the "GlossVideo object" alone. (So its link just points to a replaced file.)
Summary of @susanodd : there are some extra things we should think about if we want to do this right.
I would rather go this way:
This would definitely make my life (and for other developers easier) as we are going to upload max 6 videos per glos + 1 animation file.
- History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.
This information is already maintained and is in the admin. We can make this visible via the dataset manager page. Operations related to this would need to be added. About the accidentally deleted videos, that is more difficult. Only the paths are stored in the objects, not the actual video. You can merely see what the operation was. Not undo a delete. That the video files are not stored in the database is evidenced by there not being any videos on the development servers, unless uploaded on them.
That's where "reversion" comes in. Reversion adds extra ".bak" to the end of filenames. And it's ".bak.bak.bak" extensions depending on how many deletes. Or it's version #, in the object model with 0, 1, 2, ... and those all get modified whenever a new video file is uploaded. (The code is inconsistent in doing this. Some of this is automated.) I don't like the code that does this at all. Moreover, if the user keeps retrying to upload a video, it keeps making more of these files...
- Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.
- Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.
Where it is uploaded to is a problem. Objects need to be created. At the moment there is also "reversion" as described above.
There is also a problem with transactions versus file system operations. If the file system hasn't finished writing an upload before the object is saved something goes wrong. If the server gets overloaded with lots of file system operations the transactions will fail. Sqlite locks the entire database for atomic operations.
Django writes to temp storage first. All the operations work this way. (Some other issues are having trouble with this.)
[IMPLEMENTATION]
Maybe it would be possible to split the video upload into two parts, so the "gloss video object" and "gloss video history" objects are created, but only dummy things in the file system, so it can be done fast.
Then couple a cron job that puts all the actual files to replace the dummies. There could be a toggle field in the object to indicate whether it was a dummy or if its contents had been filled in.
In #1329 the bug seems to be with symbolic links.
I fixed this. I rewrote the code to delete first then upload, instead of replace. There is a setting in default.py that turns off deleting video files. This needs to be turned on otherwise the code cannot delete any files even if the video is deleted in the admin. This remains the same for normal videos. I made new branches for NME videos and perspective videos since the code needs to delete the objects and their videos.
Hopefully this issue can remove the reversion code.
A few responses to @rem0g :
Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.
For speed, but I wouldn't mind doing the raw video file instead.
Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.
Can you rephrase this? I'm unsure what you mean exactly here... how is this different from my original proposal?
Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.
I might be mistaken, but I think after the initial upload there are no other processing steps, now that the only check has been removed? That is, it's either uploaded to a gloss, or the upload failed?
History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.
How about an endpoint GET dictionary/{gloss_id}/video_history
, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video
with an id to restore a video uploaded earlier?
A few responses to @rem0g :
Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.
For speed, but I wouldn't mind doing the raw video file instead.
Videos are preprocessed already, but zipping video files slows down and compilates the process. Signbank has to download the provided URL and then unpack it, this can all be skipped by using simple video upload via POST form.
Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.
Can you rephrase this? I'm unsure what you mean exactly here... how is this different from my original proposal? We have this: url = "https://signbank.cls.ru.nl/dictionary/upload_zipped_videos_folder_json and then url = "https://signbank.cls.ru.nl/dictionary/upload_videos_to_glosses/5/"
That is exactly what your proposal is about, but it should be:
url = "https://signbank.cls.ru.nl/dictionary/upload_video/5/" + POST form
then the video is uploaded and binded at once, becasue it retrieves video file in binary + associated post forms.
Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.
I might be mistaken, but I think after the initial upload there are no other processing steps, now that the only check has been removed? That is, it's either uploaded to a gloss, or the upload failed?
SB does post status of other glosses in the response form when i upload entirely different glos video.
History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.
How about an endpoint
GET dictionary/{gloss_id}/video_history
, which returns a list of upload/delete dates for videos and ids to refer to them, andPOST dictionary/{gloss_id}/restore_video
with an id to restore a video uploaded earlier?
Ok
Django uploads files to a temporary storage location before they are put in the proper locations for the dataset/glosses. There is always this step.
Because reversion is being used, Django potentially moves a bunch of video files whenever a new video is uploaded. It also creates objects for each upload A video upload is never just one operation.
@Woseseltops there are two branches/pull requests now with migrations on the underlying video models. Can you checkout this code in conjunction with this issue? #1328, #1338
For restoring videos from the video history a better backup system is needed. We could do the same as for gloss delete/restore and add the archived field. [NOPE, that still depends on the file system underneath.]
My opinion is that reversion needs to be removed. It makes a mess of stuff. Even if we try to program around it, it is still there sometimes. The video history has the operations but there are no videos in it. And if you keep doing operations they keep showing up in the history because it also has unsuccessful (historical) operations stored there. (That is something to do with the file system, because if something failed the objects have already been stored in the database, but perhaps not the file system, and vice verse. Django does not wait on the file system. If there is a solution to this, specific to Django?) (In the ajax calls with videos, the "user html that has no Django in it" needs to "explicitly buffer the file" before it can do the PUSH request. Otherwise the file is empty on the Django side.) That's why I did a zip file with multiple videos in the Manage Media page. Then the user also sees the contents of their zip file and sees the videos being processed. I set a limit to the "process next 10 videos" because the server could not keep up making objects for the videos!!!!! Those video files are already "on" the Signbank file system. It's just video objects that are being created.) We also ought to avoid unnecessary burden of the server with extreme amounts of url requests. (opinion) Avoid thrashing.
There's some confusion around this issue; I'm still not 100% sure @rem0g and I are on the same page, @susanodd isn't sure about traffic... so in an attempt to remove this confusion, I'm going to build a first draft for this and see what everybody thinks.
That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.
That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.
The zip allows the special characters to remain inside the zip because you can give a simple ascii-based filename. Signbank has video files with Chinese characters in them, for example.
In #1341 you have glosses with special characters in them. We are working on locating the bugs.
No, every video should be named with signbank id to prevent issues with gloss renaming.
Gomer
On Fri, Oct 11, 2024 at 06:42 susanodd @.***> wrote:
That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.
The zip allows the special characters to remain inside the zip because you can give a simple ascii-based filename. Signbank has video files with Chinese characters in them, for example.
— Reply to this email directly, view it on GitHub https://github.com/Signbank/Global-signbank/issues/1332#issuecomment-2406543328, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC4F7XT2CIWPSIPOUF6R3DZ25JKFAVCNFSM6AAAAABO5E6FQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBWGU2DGMZSHA . You are receiving this because you were mentioned.Message ID: @.***>
No, every video should be named with signbank id to prevent issues with gloss renaming. Gomer …
Yes, of course. But the zip file itself is not the same as the video file inside that includes the annotation in the filename.
The intent of using a zip file is to shield the "information exchange over internet" from the special characters. Some of the (browser) code escapes the urls by putting "%" codes in them. The browsers do this. Inside the zip file, the actual video files can have any name. But in the past, some really old original signbank video code actually used those "%" in the filenames, then it won't match the annotation, obviously. (@Woseseltops wrote a lot of the original code, so hopefully he can investigate.)
Of course, we still need to figure out what is wrong with #1341.
There probably needs to be totally different new code for the API videos. Some of the original code still is trying to convert to mp4. (A function ensure_mp4 still exists. But most of the called functions have commented this out because it was broken. The code to make small videos is also commented out. The code to generate a "still" for the video has also been commented out in some places because there was a bug in it if the "length" of the video ends up changing when it is converted to frames and resized.) All of this code is or is not necessarily called during video upload. @Jetske and @susanodd have introduced new code to try to circumvent some old code. But the old code has not been removed. Most of it is to do with the things mentioned.)
I report here all the buggy things that are messed up with videos. It's not always clear when code is being run or which code is being run.
I agree the video upload needs a complete rewrite, ideally based on signbank ID as video name. This also goes for NMM and Voorbeeldzinnen videos, they all need ID's.
That way we have a better idea of what's going on instead of investigating problems in old codes which needs more people.
Okay I got the external part working, I can now upload video to a Signbank server like this:
# The URL where the video file will be sent
url = 'http://35.159.203.216/dictionary/gloss/12/video'
# The path to your video file
video_file_path = 'glossvideo_ASL_FO_FORCE-834.mov'
# Auth
headers = {'Authorization': f'Bearer {BEARER_TOKEN}'}
# Open the video file in binary mode and send it in a POST request
with open(video_file_path, 'rb') as video_file:
files = {'file': video_file}
response = requests.post(url, files=files, headers=headers)
# Print the server's response
print(response.status_code)
print(response.text)
The internal processing of the file will follow probably next week.
Let me know what I can do to make it more convenient for you @rem0g ! From what you write, I'm thinking perhaps a generic video upload end point, so you're not forced to put the numeric ID of the gloss in the url?
@Woseseltops can you take a look at the other issues about the videos? #1331 #1341 #673
The problem is that the database gets locked from the traffic.
No, every video should be named with signbank id to prevent issues with gloss renaming. Gomer …
I will make this be an option for the existing zip archive so you can use merely the ID as the filename. (The Manage Media page also uses the same format for the zip archive.)
You ought to put like 20 or 50 videos in the zip file. Not just one video. This will reduce traffic to the server.
The code makes a streamed-json result to include all the results for all the videos that are in the same zip file.
Based on the ID, it is already possible to just lookup the gloss to retrieve its default annotation and its lemma (first two characters) in order to construct the path. @ocrasborn came up with using the annotation (without the ID) because normally the (non-API) user is not using the IDs.
I posted many excerpts from the logs in the other issues. You can see that the database is getting locked because of the traffic. My opinion is still the same that the zip archive (with many videos) is better because it's less traffic.
We can modify the existing zip file upload request to put the file inside as a POST instead of as a URL parameter. (I need help with that POST part.)
For applications handling substantial traffic, which translates to multiple workers across various server instances (depending on the scaling strategy adopted), concurrency inevitably needs to be managed.
How does it play out in Django? Well, when you use the inbuilt ORM to call the save() method on an object, there is a good chance that two different instances of the server call this method on the same object at the same time, causing the data to get corrupted.
How can we resolve this? Quite simple. A pessimistic approach dictates that you should lock the resource in question exclusively until you are finished with it.
When a database operation is in progress, the object or the set of updated objects must be locked until the operation is complete so that no other process can access this object. This will prevent multiple server instances from loading stale data into memory and corrupting the database.
Okay I got the external part working, I can now upload video to a Signbank server like this:
# The URL where the video file will be sent url = 'http://35.159.203.216/dictionary/gloss/12/video' # The path to your video file video_file_path = 'glossvideo_ASL_FO_FORCE-834.mov' # Auth headers = {'Authorization': f'Bearer {BEARER_TOKEN}'} # Open the video file in binary mode and send it in a POST request with open(video_file_path, 'rb') as video_file: files = {'file': video_file} response = requests.post(url, files=files, headers=headers) # Print the server's response print(response.status_code) print(response.text)
The internal processing of the file will follow probably next week.
Let me know what I can do to make it more convenient for you @rem0g ! From what you write, I'm thinking perhaps a generic video upload end point, so you're not forced to put the numeric ID of the gloss in the url?
Yes perfect, will there also be callback if the video upload has succeed and processed in the database directly from the URL? If that will cause a long wait time I can adjust my scripts.
@rem0g I added api creation with video upload of voorbeeldzinnen here: https://signbank.github.io/Global-signbank/#/Adding%20Signbank%20data/post_dictionary_api_create_annotated_sentence__datasetid__ Can you see if this way of uploading files works for you? I think it would be good to at least handle all kinds of api video uploads in a consistent way.
What @Woseseltops proposes is that the API uses a file name is the path to the file destination in the Signbank file system, using underscore where slash is.
@rem0g suggested (in #1331) we redesign the video upload. I think the current workflow is:
- Name video files after glosses, store them in a zip
- Upload a zip file to a server somewhere and grab its url
- Call
/dictionary/upload_zipped_videos_folder_json/{datasetid}/
and provide this url- Call
/dictionary/upload_videos_to_glosses/{datasetid}
to actually link them to the glosses.I want to replace this with just one step, to be run for each gloss invidually:
- POST
/dictionary/gloss/{gloss_id}/zippedvideo
with just the zipped file in the payload.I don't think uploading the videos one by one has any real drawbacks in terms of overload, and it does make the whole process easier to follow: you upload a video to a gloss, and if it fails somewhere in the process, it can just return what went wrong for that one gloss.
Also, @rem0g suggests to remove the h264 check:
it's not needed to check whether it's h264. If it doesnt play, it wont play and we will know that one way or another.`
@Woseseltops the zip file contains an archive of many videos. Not just one video. It was intended to reduce traffic. For example, if you upload a zip file with 20 videos, it only needs two url requests, not 40.
I am working on making the zip archive inside the POST instead of as a URL. The original method gets the zip file from a URL because at the time, I had no idea how to put a zip file inside the POST. Tim helped me do it as a URL because you were on vacation then.
Response to @rem0g :
Yes perfect
I'm not sure if you're responding to my Python example or my (alternative) proposal below it, but to make sure I don't slow things down even further I have finished the end point as in the example and put it up for review. Let me know if you're also interested in having a generic endpoint that doesn't force you to put the gloss ID in the URL!
will there also be callback if the video upload has succeed and processed in the database directly from the URL?
This is example output from using the script above, does that answer your question?
400
{"error": "No file uploaded"}
200
{"message": "Uploaded video of size 156919 bytes to dataset NGT."}
If that will cause a long wait time I can adjust my scripts.
In my test cases it is relatively quick! There are worries on our side that adding this end point might generate a lot of parallel requests, which will lock our database (which was not designed for mass usage). We could implement rate limiting, but instead it's probably easier to ask you to put a few seconds between each request if possible :)
Yes perfect. I will take a look at it next week.
On Thu, Oct 17, 2024 at 14:35 Wessel Stoop @.***> wrote:
Response to @rem0g https://github.com/rem0g :
Yes perfect
I'm not sure if you're responding to my Python example or my (alternative) proposal below it, but to make sure I don't slow things down even further I have finished the end point as in the example and put it up for review. Let me know if you're also interested in having a generic endpoint that doesn't force you to put the gloss ID in the URL!
will there also be callback if the video upload has succeed and processed in the database directly from the URL?
This is example output from using the script above, does that answer your question?
400 {"error": "No file uploaded"}
200 {"message": "Uploaded video of size 156919 bytes to dataset NGT."}
If that will cause a long wait time I can adjust my scripts.
In my test cases it is relatively quick! There are worries on our side that adding this end point might generate a lot of parallel requests, which will lock our database (which was not designed for mass usage). We could implement rate limiting, but instead it's probably easier to ask you to put a few seconds between each request if possible :)
— Reply to this email directly, view it on GitHub https://github.com/Signbank/Global-signbank/issues/1332#issuecomment-2419423187, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC4F7UIRIQLJPRKGIVGU2DZ36VIZAVCNFSM6AAAAABO5E6FQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGQZDGMJYG4 . You are receiving this because you were mentioned.Message ID: @.***>
Okay I think the only remaining item here is this idea
How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?
I'm not sure what we have in terms of video history at all, so perhaps before I start: @rem0g , given approaching deadlines, do you still think it would make sense for me to work on this?
@rem0g suggested (in #1331) we redesign the video upload. I think the current workflow is:
/dictionary/upload_zipped_videos_folder_json/{datasetid}/
and provide this url/dictionary/upload_videos_to_glosses/{datasetid}
to actually link them to the glosses.I want to replace this with just one step, to be run for each gloss invidually:
/dictionary/gloss/{gloss_id}/zippedvideo
with just the zipped file in the payload.I don't think uploading the videos one by one has any real drawbacks in terms of overload, and it does make the whole process easier to follow: you upload a video to a gloss, and if it fails somewhere in the process, it can just return what went wrong for that one gloss.
Also, @rem0g suggests to remove the h264 check: