KurtBestor / Hitomi-Downloader

:cake: Desktop utility to download images/videos/music/text from various websites, and more.
22.01k stars 2.03k forks source link

&lt 로 인한 정보 추출 불가 #5597

Closed kkomaya closed 1 year ago

kkomaya commented 1 year ago

정보를 추출하려고 하는데, \< 로 저장이 되서 findAll에서 찾지를 못하는 이슈가 있습니다. (작업 정보의 내용을 붙여 넣으니 정상인 것 처럼 저장이 되어 캡쳐를 첨부 합니다.) 확인 부탁드립니다. image


2분 40초 뒤 다시 시작: https://18av.moe/asian-leaks/7565/

version: 3.8 (22-12-16 05:11:43 UTC) platform / locale: Windows-10-10.0.19041-SP0 / ko_kr order / group / uid: 0 / False / 2ab50c4eea2744bd9a2372a8ebd297f9 input: https://18av.moe/asian-leaks/7565/ type: 18av single: True url: https://18av.moe/asian-leaks/7565/ dir: zip: artist: None valid / done: False / True range / range_p: None / None time: 1672700434.2223842 (23-01-02 23:00:34 UTC) - 150s elapsed tags: [] lock: False color: invalid paused: False format: None p2f: None segment: None admin: True goodbyedpi: True ytdl: yt_dlp 2023.01.02 pinned: False extras: {} live: False changed: True

[Gallery] None

[File Names]

[URLs]

[Messages] 'NoneType' object is not subscriptable stop Traceback (most recent call last): File "utils", line 1278, in start File "utils", line 1353, in start_ File "dynamic_module_0", line 20, in read TypeError: 'NoneType' object is not subscriptable

Invalid: fail=True EOT: https://18av.moe/asian-leaks/7565/ (9.5s)

'NoneType' object is not subscriptable stop Traceback (most recent call last): File "utils", line 1278, in start File "utils", line 1353, in start_ File "dynamic_module_1", line 20, in read TypeError: 'NoneType' object is not subscriptable

Invalid: fail=True EOT: https://18av.moe/asian-leaks/7565/ (9.3s)

00:00
<video controls,="" allow-same-origin,="" widht="100%" ,="" data-poster="https://18av.moe/wp-content/uploads/2022/08/57120-1-640x360.jpg" id="player"> \<source src="//cdn1.thepervs.com/videos/57120.mp4" type="video/mp4" ,="" size="576"\>
No data found empty urls stop Traceback (most recent call last): File "utils", line 1278, in start File "utils", line 1368, in start_ Exception: empty urls Invalid: fail=True EOT: https://18av.moe/asian-leaks/7565/ (9.4s)
00:00
<video controls,="" allow-same-origin,="" widht="100%" ,="" data-poster="https://18av.moe/wp-content/uploads/2022/08/57120-1-640x360.jpg" id="player"> <source src="//cdn1.thepervs.com/videos/57120.mp4" type="video/mp4" ,="" size="576">
No data found empty urls stop Traceback (most recent call last): File "utils", line 1278, in start File "utils", line 1368, in start_ Exception: empty urls Invalid: fail=True EOT: https://18av.moe/asian-leaks/7565/ (9.5s)
00:00
<video controls,="" allow-same-origin,="" widht="100%" ,="" data-poster="https://18av.moe/wp-content/uploads/2022/08/57120-1-640x360.jpg" id="player"> <source src="//cdn1.thepervs.com/videos/57120.mp4" type="video/mp4" ,="" size="576">
No data found empty urls stop Traceback (most recent call last): File "utils", line 1278, in start File "utils", line 1368, in start_ Exception: empty urls Invalid: fail=True EOT: https://18av.moe/asian-leaks/7565/ (9.4s)
KurtBestor commented 1 year ago

스크립트가 어떻게 되나요?

kkomaya commented 1 year ago

기존 Sogirl의 Script를 변경했습니다.

coding: utf8

title: 18av.moe 사이트 추가 - 수정

author: Kurt Bestor (이후 수정됨)

discription: 12022-01-30 수정

import clf2 from utils import * import json

@Downloader.register class Downloader_sogirl(Downloader): type = '18av' URLS = ['18av.moe'] single = True

def read(self):
    html = self.get_page()

    soup = Soup(html)
    title = soup.find('meta', {'property': 'og:title'})['content'].strip()

    playlist = soup.find('div', class_='container')
    self.print_(playlist)                   <- 이부분을 통해서 출력했을때 '<' 을 제대로 인식을 못하는 거와 같이 출력이 됩니다.

    srcs = []                                   <- 동작성을 보기 위해 try에서 밖으로 빼본 코드 입니다. 
    for a in playlist.findAll('source'):
        self.print_(a)

    try:
        srcs = []
        for a in playlist.findAll('src'):
            data = a
            #.get('src')
            if not data:
                continue
            data = 'https:' + data      <- https:가 누락되어서 안되는가 싶어 추가한 부분입니다.
            srcs.append(data)

        url_video = srcs[0]
        print_(srcs[0])
        self.urls.append(url_video)
        self.filenames[url_video] = '{}{}'.format(clean_title(title), get_ext(url_video))
        self.referer = self.url

        self.title = title
    except:
        self.print_('No data found')

@try_n(16)
def get_page(self):
    html = clf2.solve(self.url)

    if '502: Bad gateway' in html['html']:
        self.print_(f'Cloudflare Bad Gateway Error Occured')
        raise Exception('bad gateway')

    return html['html']

log('script loaded')

KurtBestor commented 1 year ago
Soup(html)

를 다음과 같이 바꾸세요:

Soup(html, 'lxml')

참고로 playlist.findAll('src') 가 아니라 playlist.findAll('source') 예요

kkomaya commented 1 year ago

코멘즈 주신대로 수정하니 정상적으로 동작을 하네요. 감사합니다.^^