Closed ChristopherLucas closed 9 years ago
Do you mean you want to avoid having to get the transcription results for audio files that have already been transcribed, and you want to avoid sending files that already have jobID's?
It is possible to return Null (empty string, "") rather than the 'status'. Perhaps a time stamp alerting when the function should be called again?
The current function just tries job IDs that were "queuing" at the time of the last call, so that actually works well. The fragile part is just the reliance on the CSV.
Probably the best way to do it is to just make retrieveText() smarter as it reads the csv (e.g., use more columns and have it subset to the ones it needs based on the existing columns or something).
Oh by the way, I was wrong the affect on performance will be barely noticeable if you have to read and write the CSV again.
Interesting, thanks. Still, you were right about the design of the dataframe/table. Whatever the type, we should just create it with all eventual columns (including transcribed.text). That solves the second task in this issue and makes passing CSVs between sendAudio() and retrieveText() trivial. Just do whatever is simplest here.
As we discussed, the first task in this issue can be solved by storing full filenames and then grep'ing the new files to determine which should be skipped.
At present, sendAudio() writes out a csv that contains the job IDs. You can add new jobs by passing the CSV back through sendAudio.
1) If given the same directory, sendAudio will reupload the same files but not with new job IDs. Let's add an argument to sendAudio that makes the default to not reupload files with the same file name. Something like reupload = FALSE.
Then, the CSV written by sendAudio() goes into retrieveText(), which creates a data frame with a new column. That dataframe can go back into sendAudio() (in the event that some jobs aren't finished, it will fill them in without re-GETing all other jobs). However, now it has a new column, and when written, it might not be able to go back into sendAudio. This is because this line in sendAudio isn't very good.
So...
2) We need a smarter way to identify a transcribeR csv, which will identify a csv and appropriately append to it at any stage in the workflow (first time through sendAudio, second time through, back to sendAudio after retrieveText, etc).