chbrown / overdrive

Bash script to download mp3s from the OverDrive audiobook service
452 stars 66 forks source link

Error: "Output already exists" during download, plus spurious chars in author and title #23

Closed gtTracy closed 4 years ago

gtTracy commented 4 years ago

Greetings,

Environment: Ubuntu 20.04 Script version: Dated 4/29/2020 (tried downloading latest copy, same problem exists)

(Edited issue to resolve formatting problem with metadata excerpt)

I don't know if this is a problem with Overdrive itself, or if something changed in one of the prerequisite tools, but I am having the following problem when downloading items from Overdrive (note that this occurs with multiple different items, I am using this one merely as an example - also note that client and license information has been removed, as there doesn't appear to be a problem with acquiring and using the license):

tracy@tracy-HP:/media/tracy/Refurb/LibAB$ overdrive download 'Buying Time.odm'
License already acquired: Buying Time.odm.license
Using License=<license info removed></License>
Using ClientID=<removed> from License
Using Author=Joe Haldeman, 
Using Title=Buying Time-
Creating directory Joe Haldeman,  - Buying Time-
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part01.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part01.mp3 successfully
Downloading Joe Haldeman,  - Buying Time-/Buying Time--
Downloaded Joe Haldeman,  - Buying Time-/Buying Time-- successfully
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part02.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part02.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part03.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part03.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part04.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part04.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part05.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part05.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part06.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part06.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part07.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part07.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part08.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part08.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Downloading Joe Haldeman,  - Buying Time-/Buying Time--Part09.mp3
Downloaded Joe Haldeman,  - Buying Time-/Buying Time--Part09.mp3 successfully
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--
Using CoverUrl=https://images.contentreserve.com/ImageType-100/6578-1/%7BCD9D8596-387E-4F84-A4A3-8734E1F6D941%7DImg100.jpg
Downloading Joe Haldeman,  - Buying Time-/folder.jpg
Downloaded cover image successfully
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ 

Note on the Title and Author lines the spurious characters after the names (the comma after the author, and the dash after the title). Also note the repeated "Output already exists" lines (after each download except the first).

I have successfully used the script in the past (back in early May), but at that time I was running Ubuntu 19.10 (I did a clean install when I moved to 20.04, so it is possible that some underlying dependency changed, or even is missing).

Here are what I think are the relevant lines from the odm files (I can provide the entire odm file, if that will be helpful):

    <![CDATA[<Metadata>
    <ContentType>MP3 Audio Book</ContentType>
    <Title>Buying Time</Title>
    <SortTitle>Buying Time</SortTitle>
    <Publisher>Novel Audio</Publisher>
    <ThumbnailUrl>https://images.contentreserve.com/ImageType-200/6578-1/{CD9D8596-387E-4F84-A4A3-8734E1F6D941}Img200.jpg</ThumbnailUrl>
    <CoverUrl>https://images.contentreserve.com/ImageType-100/6578-1/{CD9D8596-387E-4F84-A4A3-8734E1F6D941}Img100.jpg</CoverUrl>
    <Creators>
        <Creator role="Author" file-as="Haldeman, Joe">Joe Haldeman</Creator>
        <Creator role="Narrator" file-as="Vale, Eric">Eric Vale</Creator>
    </Creators>

And here is how the same data appears in the metadata file:

<Metadata>
<ContentType>MP3 Audio Book</ContentType>
<Title>Buying Time</Title>
<SortTitle>Buying Time</SortTitle>
<Publisher>Novel Audio</Publisher>
<ThumbnailUrl>https://images.contentreserve.com/ImageType-200/6578-1/{CD9D8596-387E-4F84-A4A3-8734E1F6D941}Img200.jpg</ThumbnailUrl>
<CoverUrl>https://images.contentreserve.com/ImageType-100/6578-1/{CD9D8596-387E-4F84-A4A3-8734E1F6D941}Img100.jpg</CoverUrl>
<Creators>
<Creator role="Author" file-as="Haldeman, Joe">Joe Haldeman</Creator>
<Creator role="Narrator" file-as="Vale, Eric">Eric Vale</Creator>
</Creators>

Any ideas what might be going wrong?

chbrown commented 4 years ago

Weird. No, can't tell what's going wrong from that. Harder to say without using latest version's output.

Maybe try removing the space from the filename, and correspondingly the license file since it won't let you generate two license for one loan. Delete the .odm.metadata file and retry.

gtTracy commented 4 years ago

I have tried other odm files that didn't have spaces in their names (the only reason that one did is because it was going to be saved as "multipart.odm", so I changed the name by typing a new one when it was saved). But others, such as "AuberonTheExpanseSeriesBook85-5795.odm" also produce the same type of spurious output.

I had backed down to the older version of the script originally to see if that fixed the problem. Since it didn't, I will pull a new copy of the script this evening, and run it against one of the files that has no space in the name. Do you want the regular output, or the --verbose output?

gtTracy commented 4 years ago

Edit note: again cleaning up formatting for the command output, and the odm and odm.metadata excerpts... Ugh. I can't get the formatting right for the post. So I'm going to attach a text file with all the output, and the file excerpts...

OK, so back using the newest version of the script (pulled using the instructions on the main page). I decided to try a slightly different test, since I believe the spurious characters in the author and title might be causing the other problem. So here is output showing that the spurious characters are still being generated: script output.txt

tracy@tracy-HP:/media/tracy/Refurb/LibAB$ overdrive --version
2.0.0
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ overdrive info AuberonTheExpanseSeriesBook85-5795.odm
author  title   duration
James S. A. Corey,  Auberon-    8821
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ overdrive info AuberonTheExpanseSeriesBook85-5795.odm --verbose
Entering debug (verbose) mode
+ CURLOPTS=("${CURLOPTS[@]:1}")
+ shift
+ [[ 0 -gt 0 ]]
+ [[ 1 -eq 0 ]]
+ [[ 1 -eq 0 ]]
+ HEADER_PRINTED=
+ for ODM in "${MEDIA[@]}"
+ for COMMAND in "${COMMANDS[@]}"
+ case $COMMAND in
+ info AuberonTheExpanseSeriesBook85-5795.odm
+ [[ -z '' ]]
+ printf '%s\t%s\t%s\n' author title duration
author  title   duration
+ HEADER_PRINTED=1
+ metadata_path=AuberonTheExpanseSeriesBook85-5795.odm.metadata
+ extract_metadata AuberonTheExpanseSeriesBook85-5795.odm AuberonTheExpanseSeriesBook85-5795.odm.metadata
+ [[ -e AuberonTheExpanseSeriesBook85-5795.odm.metadata ]]
+ xmllint --noblanks --xpath '/OverDriveMedia/text()' AuberonTheExpanseSeriesBook85-5795.odm
+ sed -e '1s/^<!\[CDATA\[//' -e '$s/]]>$//'
+ tidy -xml -wrap 0 -quiet
++ extract_author AuberonTheExpanseSeriesBook85-5795.odm.metadata
++ _xmllint_iter_xpath '//Creator[starts-with(@role, '\''Author'\'')][position()<=3]' AuberonTheExpanseSeriesBook85-5795.odm.metadata
++ sed '$!s/$/, /'
++ tr -d '\n'
+++ xmllint --xpath 'count(//Creator[starts-with(@role, '\''Author'\'')][position()<=3])' AuberonTheExpanseSeriesBook85-5795.odm.metadata
++ count=1
+++ seq 1 1
++ for i in $(seq 1 "$count")
++ xmllint --xpath 'string(//Creator[starts-with(@role, '\''Author'\'')][position()<=3][position()=1])' AuberonTheExpanseSeriesBook85-5795.odm.metadata
++ printf '\n'
++ extract_title AuberonTheExpanseSeriesBook85-5795.odm.metadata
++ xmllint --xpath '//Title/text()' AuberonTheExpanseSeriesBook85-5795.odm.metadata
++ tr -Cs '[:alnum:] ._-' -
++ extract_duration AuberonTheExpanseSeriesBook85-5795.odm
++ _xmllint_iter_xpath //Part AuberonTheExpanseSeriesBook85-5795.odm /@duration
++ awk -F : '{sum += $1*60 + $2} END {print sum}'
+++ xmllint --xpath 'count(//Part)' AuberonTheExpanseSeriesBook85-5795.odm
++ count=2
+++ seq 1 2
++ for i in $(seq 1 "$count")
++ xmllint --xpath 'string(//Part[position()=1]/@duration)' AuberonTheExpanseSeriesBook85-5795.odm
++ printf '\n'
++ for i in $(seq 1 "$count")
++ xmllint --xpath 'string(//Part[position()=2]/@duration)' AuberonTheExpanseSeriesBook85-5795.odm
++ printf '\n'
+ printf '%s\t%s\t%d\n' 'James S. A. Corey, ' Auberon- 8821
James S. A. Corey,  Auberon-    8821
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ 

I included the output from the regular "info" command, and the --verbose output from the same command. Below are the relevant portions of the odm file, and the odm.metadata file (again, license information removed):

.odm

<?xml version="1.0"?>
<OverDriveMedia id="8b6c121c-f7d4-4b47-a509-3faae7236353-425" ODMVersion="3.0.0.0" OMCVersion="3.0.0.0">
    <License>
    </License><![CDATA[<Metadata>
    <ContentType>MP3 Audio Book</ContentType>
    <Title>Auberon</Title>
    <SubTitle>The Expanse Series, Book 8.5</SubTitle>
    <SortTitle>Auberon The Expanse Series Book 85</SortTitle>
    <Series>The Expanse</Series>
    <Publisher>Hachette Audio</Publisher>
    <ThumbnailUrl>https://images.contentreserve.com/ImageType-200/4575-1/{8B6C121C-F7D4-4B47-A509-3FAAE7236353}Img200.jpg</ThumbnailUrl>
    <CoverUrl>https://images.contentreserve.com/ImageType-100/4575-1/{8B6C121C-F7D4-4B47-A509-3FAAE7236353}Img100.jpg</CoverUrl>
    <Creators>
        <Creator role="Author" file-as="Corey, James S. A.">James S. A. Corey</Creator>
        <Creator role="Narrator" file-as="Mays, Jefferson">Jefferson Mays</Creator>
    </Creators>
    <Subjects>
        <Subject id="26">Fiction</Subject>
        <Subject id="80">Science Fiction</Subject>
    </Subjects>
    <Languages>
        <Language code="en">English</Language>
    </Languages>

odm.metadata:

<Metadata>
<ContentType>MP3 Audio Book</ContentType>
<Title>Auberon</Title>
<SubTitle>The Expanse Series, Book 8.5</SubTitle>
<SortTitle>Auberon The Expanse Series Book 85</SortTitle>
<Series>The Expanse</Series>
<Publisher>Hachette Audio</Publisher>
<ThumbnailUrl>https://images.contentreserve.com/ImageType-200/4575-1/{8B6C121C-F7D4-4B47-A509-3FAAE7236353}Img200.jpg</ThumbnailUrl>
<CoverUrl>https://images.contentreserve.com/ImageType-100/4575-1/{8B6C121C-F7D4-4B47-A509-3FAAE7236353}Img100.jpg</CoverUrl>
<Creators>
<Creator role="Author" file-as="Corey, James S. A.">James S. A. Corey</Creator>
<Creator role="Narrator" file-as="Mays, Jefferson">Jefferson Mays</Creator>
</Creators>
<Subjects>
<Subject id="26">Fiction</Subject>
<Subject id="80">Science Fiction</Subject>
</Subjects>
<Languages>
<Language code="en">English</Language>
</Languages>

My original thought was that the script was picking up the "file-as" clause (instead of the text name for the Creator tag) in the metadata, then unwrapping the name and not removing the trailing comma. But that doesn't explain the spurious - at the end of the title. So, I am at a loss.

I suspect if we track this down, it will end up being something silly, like a missing prerequisite, or a prerequisite that has changed behavior. I just don't know enough about the way the script is structured to be able to put a finger on it. I would guess that it is not the tidy command, because the odm.metadata file doesn't have any spurious characters. But the rest...? I dunno. My intuition is pointing at xmllint, but I have nothing at all to back that up....

I am putting the odm file along with the license file into a separate directory for safe keeping (so they don't accidentally get changed or deleted). So I will be using the above referenced file (AuberonTheExpanseSeriesBook85-5795.odm) for any further testing - unless you want me to try some other files. Again, worth noting that this is happening with each and every file I have tried today (about 15 different files, some names with spaces, some names without, etc)...

Ideas?

gtTracy commented 4 years ago

For what it's worth, here are the various prerequisite commands and their version numbers as installed on my system (note that openssl --version didn't return any useful output, but since that's only used in getting the license file, and we are successfully getting that, I didn't worry about that one)...

tracy@tracy-HP:/media/tracy/Refurb/LibAB$ curl --version
curl 7.68.0 (x86_64-pc-linux-gnu) libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Release-Date: 2020-01-08
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS brotli GSS-API HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ uuidgen --version
uuidgen from util-linux 2.34
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ xmllint --version
xmllint: using libxml version 20910
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ICU ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma 
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ iconv --version
iconv (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ openssl --version
Invalid command '--version'; type "help" for a list.
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ base64 --version
base64 (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Simon Josefsson.
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ tidy --version
HTML Tidy for Linux version 5.6.0
tracy@tracy-HP:/media/tracy/Refurb/LibAB$ 
chbrown commented 4 years ago

Just tested Auberon from my library, which has the same filename, and downloads on my machine without a hitch (or extra comma).

I agree that it's almost certain it'll end up being something silly. Comparing my --verbose output to yours, I'm suspecting some dumb whitespace interaction between xmllint and sed :( IDK though. Locally I'm using sed that is macOS's built-in BSD May 10, 2005 version. GNU sed seems to work just as well though.

It's downloading the files, though, right? Just trying to download the directory too for some reason and realizing it already exists? Not exactly not a bug, but not totally breaking, either...

gtTracy commented 4 years ago

Oh, definitely not broken, just not "elegant". I mean, if it's something to live with, then so be it. But I sure would like to find out why... I guess I'm just crazy that way :-)

By the way... What is the best way to submit to you potential changes to the script? I have been toying around with it, trying to add a couple of features (support for the Narrator field, if provided in the .odm, and the ability to split out the folder structure to Author/Title[/Narrator] rather than Author - Title)? Bear in mind that I know almost nothing about submitting code changes or managing version control, so anything more complicated than (in essence) sending you the modified shell script will likely require some hand-holding...

chbrown commented 4 years ago

Inelegant, certainly. I'm all for seamless cross-platform consistency, so if you figure it out, and whatever change works the same here in macOS/BSD land, I'd be happy to merge it :)

GitHub's web interface makes it pretty simple to turn a modified file into a diff / commit that you can then submit as a PR.

That said, I'm wary of adding too much flexibility to this script. If your change / PR is something like "This adds a CLI flag --atn to use the file structure Author/Title/Narrator/PartNN.mp3 instead of the de facto Author - Title/PartNN.mp3 hierarchy" I would probably reject it. No hard feelings, just — that's outside the scope of this tool. I hold myself to the same sentiment — I regularly use this script in conjunction with https://github.com/chbrown/booktool — I created both, and it'd be simpler for me if I merged them, but potentially less usable by other people.

gtTracy commented 4 years ago

No worries at all. I was (am) going to implement something along those lines (for personal use - I can submit it if you want, or not if you don't, no problem either way), but I am also going to see if I can track down a method of handling the spurious characters as well. I will do these as separate projects (in whatever free time I can devote to them), so that one is not dependent on the other being accepted.

Do you have any particular preferences in how the solutions may be implemented? I know some people have feelings (one way or the other) about having a script launch sub-scripts, and some people want the most compact way while others want everything spelled out in detail... If you have preferences, I will keep them in mind - otherwise I will just see what I can do and you can decide if it's worth adding :-)

gtTracy commented 4 years ago

Two things I came across, and I'm not sure if they are the actual problems, or if they are just working because I have only tried a couple of odm files. In the extract_title function, the line:

  | tr -Cs '[:alnum:] ._-' -

appears to have a spurious - at the very end of the line. Not sure if that's there for a specific reason, but removing it removes the problem with the title having a - after it.

Meanwhile, in the extract_author function, I find this line:

  _xmllint_iter_xpath "//Creator[starts-with(@role, 'Author')][position()<=3]" "$1" \

I am uncertain of the significance of the underscores in the _xmllint_iter_xpath call - not sure what they do or why they are there. If I replace that line with this line:

  xmllint --xpath "//Creator[starts-with(@role, 'Author')]/text()" "$1" \

It seems to work as expected, with no spurious commas in the output of the author names.

I am only just starting to look - there are some other things I want to look into (replacing the sed calls with built-in bash string processing, to help potentially speed things up), as well as possibly eliminating some of the subshell calls, where possible (again, speed related things). But I thought I would pass this on to you and see if any of it made sense to you, or if I'm barking up the wrong tree with these things...

chbrown commented 4 years ago

In the extract_title function, the (intended) behavior of the tr call is basically:

So the final - is not spurious (unless I'm misunderstanding something, or you have a different tr).

Test example: find a book with a colon in the title. The resulting title as used by download or printed by info should not contain a colon.


_xmllint_iter_xpath is a function defined above, which serves as a hack to handle cases where the QUERY in xmllint --xpath QUERY file.xml has (potentially) multiple matches. I could not otherwise figure out how to get xmllint to distinguish (separate) the values of multiple results with any kind of separator character, much less a custom character. Believe me, I tried 😖 ... but hey, if you figure out a way, let me know! 🙏

Test example: find a book with multiple authors (where the Creator tag's role attribute is "Author" or "Author and narrator"). The resulting author as used by download or printed by info should consist of (up to 3 of) the authors, separated by commas.

gtTracy commented 4 years ago

Sorry about the stupid question about the definition for _xmllint_iter_path. I completely missed it when reading through the script. I found an Overdrive book with more than 4 authors, and my modified script definitely puts them all in there. Working on a possible solution to that at this time.

And, you are quite correct, the - on the tr command is not spurious for the use case defined. I found a couple of book titles that have non-alphanumerics (colons, commas, etc) in the titles, and the function does replace those correctly (even if the trailing character is still getting caught - which I am beginning to suspect is a trailing newline (\n) character which is getting converted). I have an idea on how to approach that, but I haven't quite gotten it working yet.

I have no experience on Mac platforms, so I'm not sure if there are limitations on the bash implementation there (as opposed to Ubuntu, which is where I am at the moment), so please excuse me if this is a stupid question, but... Does bash on the Mac platform support arrays? One of the possible solutions I am working toward uses them (but I have an alternate that doesn't, so if they aren't supported - or the support for them isn't widespread enough to justify them - I can work with that). I was doing some reading of various things on the net, and there was a suggestion that at least some POSIX platforms don't support arrays - wasn't sure if that applied in this case, or not.

gtTracy commented 4 years ago

Not elegant code (yet - still working on it), but here's a first cut - doesn't have the spurious comma at the end, at least...

extract_author_limit(){
  # Usage extract_author_limit book.odm.metadata count_to_return
  # This method provides a bit of flexibility on how many authors to return
  # You can return as few or as many as you wish.
  # The plan, eventually, is to make the "role" name a parameter, so that
  # it is possible to return authors, narrators, editors, etc.
  (( count=0 ))
  (( stop=$2 ))
  authors=""

  while read; 
  do
    auth=$REPLY
    if (( ++count <= stop )); then
      if [ -z "$authors" ]; then
        authors+=$auth
      else
        authors+=", "$auth
      fi
    fi
  done < <(xmllint --xpath "//Creator[starts-with(@role, 'Author')]/text()" "$1")

  echo "$authors"
}

# Modified extract_author - testing various theories and processes
#
extract_author() {
  # Usage: extract_author book.odm.metadata
  # Most Creator/@role values for authors are simply "Author" but some are "Author and narrator"
  extract_author_limit $1 3
}
chbrown commented 4 years ago

I bet you're right about the trailing newlines being the culprit. On your Linux distro, your tools follow proper UNIX etiquette of ensuring that text processing commands emit output with a trailing newline. On my macOS, these tools are a disjointed mixture of GNU and BSD, which for the most part are interchangeable, but with enough differences that weird edge cases are bound to crop up eventually, as you've noticed 😖

Reminds me of some previous issues/PRs that you might find illuminating:

P.S.: Arrays are totally fine. In fact, the script currently uses them pervasively — the MEDIA, COMMANDS, and CURLOPTS variables are all arrays :) I've never encountered a bash that didn't support arrays, but FWIW, macOS Mojave comes with GNU bash, version 3.2.57 and I use GNU bash, version 5.0.18 installed via Homebrew. IDK what bash 4 -> 5 changed, but I'd say bash 4 is a decently well-established common denominator.

gtTracy commented 4 years ago

First cut at getting rid of the trailing -. Unless you can think of a case not covered here, I don't see much more to do to this one.

# Modified extract_title - testing various theories and processes
#
extract_title() {
  # Usage: extract_title book.odm.metadata
  Title=$(xmllint --xpath '//Title/text()' "$1")
# :word: expands to letters, digits and underscore, :blank: is space
  echo ${Title//[^[:word:][:blank:].]/-}
}
gtTracy commented 4 years ago

Well, I think I have the changes that I would submit pared down to the reasonable minimum, but I can't seem to figure out how to contribute. Like I said earlier, I don't know much (well, OK, at this point, I'm willing to settle for "don't know anything") about versioning and code submissions - and I'm not having much luck at figuring it out. I'm probably missing something blindingly obvious, but...

So, anyway, here are the two changes that I would submit - I have listed the full functions that would replace the existing functions in the script:

extract_author() {
  # Usage: extract_author book.odm.metadata
  # Most Creator/@role values for authors are simply "Author" but some are "Author and narrator"
  (( count=0 ))
  (( stop=3 )) # This can be customized to allow more, or less, entries returned
  authors=""

  while read -r; 
  do
    auth="$REPLY"
    if (( ++count <= stop )); then
      authors="$authors"${authors:+", "}"$auth"
    fi
  done < <(xmllint --xpath "//Creator[starts-with(@role, 'Author')]/text()" "$1")

  echo "$authors"
}

extract_title() {
  # Usage: extract_title book.odm.metadata
  Title=$(xmllint --xpath '//Title/text()' "$1")
  # :word: expands to letters, digits and underscore, :blank: is space
  echo ${Title//[^[:word:][:blank:].]/-}
}

If they work for you, and you don't find any issues with them, they are yours for the taking. My goals were twofold:

1) Resolve the issues that brought me here 2) Do this without bringing in any external dependencies (ie. make them as close to 100% bash as I could without breaking anything else)

Hopefully they will be of use.

Meanwhile, I'm getting ready to tear the whole script apart and make a bunch of changes (for local consumption only - things that I want that aren't currently there, or things that I want done differently). When I get done, it will probably be quite unrecognizable, and if it works it will be a miracle (especially considering this is actually my first stab at doing anything related to bash scripting)

chbrown commented 4 years ago

(git / version control are anything but blindingly obvious 😉   but in any case no need to worry about versioning / code submissions right now)

Looking at your code in the last comment... I didn't try it out locally, but I don't see how it addresses the problem that _xmllint_iter_xpath solves, i.e., xmllint doesn't separate output values when the given --xpath query matches multiple nodes. Perhaps you have a substantially different xmllint, because for me:

$ xmllint --xpath '//a/text()' <(echo '<root><a>1</a><a>2</a><b>3</b></root>')
12

(and there's no newline following the 12)

Btw, on my machine:

$ xmllint --version
xmllint: using libxml version 20904
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude ICU ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib

...which is older, but IDK what sort of versioning scheme allows changing that sort of fundamental behavior between "20904" and "20910" 😠

Also, I didn't realize (/TIL!) you could use regex character classes in bash string substitutions. Cool!


Anyhow, taking a closer look at this, I think I came up with a more... economical solution. I made two smallish changes, e9b294c and e6744a5 (the latter is kind of cheating, but overall simplifies things 😄 ). Can you try out the version on the "spurious-newlines" branch and see if it fixes your original issue? (That version will print out 2.1.0 when you call overdrive --version.)

gtTracy commented 4 years ago

So.... I tried your updated script (from the "spurious-newlines" branch), and it works, after a fashion. I only got one author listed for a file that definitely has more than one. But I did not get a spurious comma after the author, nor a spurious dash after the title. So, we have a partial win.

I created a very simple test harness script, to test out the two functions (actually, three, because I added an extract_all_author function that doesn't limit the number of returned authors), and ran that script against the same data files as I used to test your updated script. Here's the output from everything, along with xmllint version info:

racy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ which xmllint
/usr/bin/xmllint
tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ /usr/bin/xmllint --version
/usr/bin/xmllint: using libxml version 20910
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ICU ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma 
tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ ~/ScriptTest/od2.1.0 info TorcomCollection.odm
author  title   duration
Kai Ashante Wilson  Tor.com Collection  129599
tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ ~/ScriptTest/harness TorcomCollection.odm.metadata
Metadata file:  TorcomCollection.odm.metadata
Title:  Tor.com Collection
Author list:  Kai Ashante Wilson, Paul Cornell, Alter S. Reiss

All authors:  Kai Ashante Wilson, Paul Cornell, Alter S. Reiss, Nnedi Okorafor, K. J. Parker, Angela Slatter, Matt Wallace, Daniel Polansky, Sylvia Spruck Wrigley, Michael R. Underwood

Here's is the test harness script, if you want to grab it and give it a whirl. Note that you have to pass the book.odm.metadata file name as the argument (not just the book.odm) because I stripped out everything else for testing purposes. But if you've run your script against the book.odm file, the book.odm.metadata file should already be there to test against.

#!/usr/bin/env bash

extract_all_author() {
  # Usage: extract_author book.odm.metadata
  # Most Creator/@role values for authors are simply "Author" but some are "Author and narrator"
  authors=""

  while read -r; 
  do
    auth="$REPLY"
    authors="$authors"${authors:+", "}"$auth"
  done < <(xmllint --xpath "//Creator[starts-with(@role, 'Author')]/text()" "$1")

  echo "$authors"
}

extract_author() {
  #Usage: extract_author book.odm.metadata
  (( count=0 ))
  (( stop=3 ))
  authors=""

  while read -r; 
  do
    auth=$REPLY
    if (( ++count <= stop )); then
      authors=$authors${authors:+", "}$auth
    fi
  done < <(xmllint --xpath "//Creator[starts-with(@role, 'Author')]/text()" "$1")

  echo "$authors"
}

extract_title() {
  # Usage: extract_title book.odm.metadata
  Title=$(xmllint --xpath '//Title/text()' "$1")
  # :word: expands to letters, digits and underscore, :blank: is space
  echo ${Title//[^[:word:][:blank:].]/-}
}

echo "Metadata file: " "$1"
echo "Title: " "$(extract_title "$1")"
echo "Author list: " "$(extract_author "$1")"
echo ""
echo "All authors: " "$(extract_all_author "$1")"
echo ""
echo ""
gtTracy commented 4 years ago

Edit: Added info related to differences in xmllint command.

BTW, I went back and tried your example xmllint command - apparently mine does work differently than yours, because:

tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ xmllint --xpath '//a/text()' <(echo '<root><a>1</a><a>2</a><b>3</b></root>')
1
2
tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ 

So I''m not sure what might be going on there... When I try it with "live" data, I get:

tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ xmllint --xpath "//Creator[starts-with(@role, 'Author')]/text()" TorcomCollection.odm.metadata
Kai Ashante Wilson
Paul Cornell
Alter S. Reiss
Nnedi Okorafor
K. J. Parker
Angela Slatter
Matt Wallace
Daniel Polansky
Sylvia Spruck Wrigley
Michael R. Underwood
tracy@tracy-HP:/media/tracy/Refurb/LibAB/odm$ 

So I compared the two "Compiled with" lines between yours and mine, and yours has one thing that mine doesn't - Expr - and mine has two that yours doesn't - Iconv Lzma. I don't know if either of those differences would make a difference in this case, but I'll do a bit of research to see if I can find out anything. Meanwhile, I'll see if I can find a way around the problem. Or, alternatively, I'll look at your new script to see if anything jumps out at me as to why it might have returned only one author for a multi-author book.

chbrown commented 4 years ago

Right, exactly — e6744a5 intentionally treats the first author as the only author 😢 🤷‍♂️ . That's come up before (#16), it came up here (despite taking a while to realize that's what it was), so I bet it'll come up again. In my library, just about 90% of the books have only one author, so it's not a huge loss. Worth the tradeoff to increase resiliency / reliability / simplicity, IMO.

Thanks for the xmllint comparison. That confirms what I was starting to figure was the cause of all this — and yes, as you predicted, it's indisputably "something silly" 😖

One last thing: what about the extra downloads with partial names? E.g., in your original comment you had a couple spurious curl calls like:

Downloading Joe Haldeman,  - Buying Time-/Buying Time--
# [...]
Output already exists: Joe Haldeman,  - Buying Time-/Buying Time--

(and I'm guessing that the resulting Buying Time-- file was empty, or perhaps some HTML from a 404 page)

gtTracy commented 4 years ago

Honestly, I don't see using only the first author as a problem, at least in terms of storing the downloaded files. Once I start processing the files (later on in my process), I would want to know about additional authors, but that's a different stage of things.

I haven't had a chance to look into the "output already exists" error (even though that was my "original problem")... It wasn't as high on the list because it didn't actually affect what I was getting in terms of files and such. It's still on my list to check into.

BTW, in doing a bit of digging, apparently the \n delimiting of returned data from xpath was "fixed" in 2.9.10, so updating xmllint to 2.9.10 (or later, presumably) would make the difference for that part of things (which explains why I didn't see the problem that necessitated _xmllint_iter_xpath).

gtTracy commented 4 years ago

I just went back and checked the "spurious newlines" branch code, and it is still producing the "output already exists" messages. It is worth noting that the files themselves do download correctly - first glance tells me the script is trying to create the folder to download into each time it starts to retrieve a file. So I will look into that next. (I had been using the "info" option rather than the "dowload" option for most of my testing, since the other problems all seemed to revolve around the extract_author and extract_title functions). I'll let you know what (if anything I find, once I have had a chance to look :-)

chbrown commented 4 years ago

Ooh, good to know that the behavior of xmllint --xpath is a known issue that got "fixed."

I just pushed 2.1.1 which should solve all your problems 😜 ... well, at least those mentioned in the title of this issue 😃  (if not, lmk)

(In hindsight, I'm pretty sure 7157af2 alone would handle those problems, but I think the other couple of commits that got us here are still improvements 🤷‍♂️ )

gtTracy commented 4 years ago

So... Made a tiny change to the "spurious newlines" branch code I downloaded, and immediately found the "output already exists" problem. Which is exactly what we thought it was - spurious newlines. Now, I haven't dug deeply enough into the code to give you an exact fix, but here's what's happening.

In extract_filenames, you are calling _xmllint_iter_xpath to return individual paths, and you are stringing them together putting a \n between them (to work around the problem in xmllint versions previous to 2.9.9 where xmllint --xpath doesn't delimit returns with \n). However, with xmllint 2.9.10 (which is what I have), adding that \n to the returned data causes a doubled \n. When this comes back to the loop in download, the read -r path is getting the first part name, then an "empty" part name (between the doubled \n) - it's the "empty" part that is causing the "output already exists", because $path is null, which causes $suffix to be null and $output to be identical with $dir.

So, the simplest possible fix (currently untested, but logically should work) is on the line immediately below read -r path, check $path for null/empty. If it is, loop for the next value.

Or, do what you already did - you posted your commit while I was typing here O:-) Newest commit script works without any problems :-)