DoctorD1501 / JAVMovieScraper

Scrape XBMC and Kodi movie metadeta and automatically rename files for Japanese Adult Videos (JAV), American Adult DVDs, and American Adult Webcontent
GNU General Public License v2.0
748 stars 162 forks source link

Issue & suggestion on how to clarify and improve DMM scraping #260

Open MattePainting opened 5 years ago

MattePainting commented 5 years ago

DMM sells DVDs, limited editions, and digital versions of JAV titles, so there's often multiple entries on their website for the same film. If you use the manual search selection for scraping, the DMM search often gives you multiple seemingly identical choices for the same title. I'm not sure which one is selected when done automatically.

The different versions do have either slightly different IDs, or completely different URLs, so it should be possible to differentiate between them. Having a hierarchy for which version is preferred, or a filter to ignore some of the entries, would make the DMM metadata way more consistent and accurate. If it's available, the regular DVD entry appears to be the best option.

Here's an example of the different IDs for the same title, SSNI-347:

DVD: http://www.dmm.co.jp/mono/dvd/-/detail/=/cid=ssni347/ Limited DVD: http://www.dmm.co.jp/mono/dvd/-/detail/=/cid=tkssni347/ Digital: http://www.dmm.co.jp/digital/videoa/-/detail/=/cid=ssni00347/

This is a new release, so the metadata is pretty much the same for all, but here's an example of how this becomes more confusing for older titles in particular:

Digital: http://www.dmm.co.jp/digital/videoa/-/detail/=/cid=migd074/ Digital Moodyz: http://www.dmm.co.jp/monthly/moodyz/-/detail/=/cid=migd074/ Digital Premium: http://www.dmm.co.jp/monthly/premium/-/detail/=/cid=migd074/

Same video on three different parts of the site and all three of them have different dates listed. Digital videos use a distribution start date instead of the release date, so the scraped date is often wrong for older titles. If any of the dates are correct, it's usually the oldest one and on the regular digital page, but JAVMovieScraper often scrapes the page with the newest date.

This is a long post for a minor issue, but making the search and scrape more consistent and accurate will save everyone a lot of time and frustration.

Wizell commented 5 years ago

Correct me is i got this one wrong but on DMM, you are suggesting scraping all the results with the same ID and merge the data ?

For the release date, picking the oldest one looks like a reasonable choice but for the plot, genres or images it might not be the same. What do you suggest for those fields ?

MattePainting commented 5 years ago

I'm not suggesting merging the fields. Instead, there should be a default page/entry JMS chooses to scrape from, and then an alternative if the default entry doesn't exist.

DVD (Default) --> Digital Video --> Skip/Ignore the rest

The DVD page is always the best choice. All other pages should be skipped/ignored if it exists. Next best is the regular Digital Video page. Those two pages should cover any title listed on DMM.

The other pages have either a wrong ID listed or the wrong date. They can be ignored entirely.

Wizell commented 5 years ago

It seems fine, i will look into implementation details.

I'm no used to DMM website, on the second example, you are saying that moodyz and premium should be ignored or should the order by something like videoa>moody>premium ? Is there movies only in those sections ?

I started a wiki page here to keep this information. Please correct the wiki page if i didn't understand properly.

MattePainting commented 5 years ago

Digital Premium and Digital Moodyz are subscriptions that give you access to a bunch of different movies. There aren't any movies that are only in those sections. Everything there will also be in the regular Digital section. I would just do videoa and ignore the subscription sections entirely.

Moodyz is actually just a JAV studio. There are individual subscriptions for most of the bigger studios (S1, Soft on Demand, etc).

Thanks for the wiki page! I'll take a look at it. (Edit: Doesn't look like I have permission to edit the page.)