codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

The content extracted by newspape is out of order #947

Open riusksk opened 2 years ago

riusksk commented 2 years ago

When use newspaper to extract articles containing code, the content sequence is incorrect, for example, http://akat1.pl/?id=2

The error is placed in the pass-through() function of mail.local:
<code>

After extraction, it becomes:

<code>
The error is placed in the pass() function of mail.local: 

this bug is exist in convert_to_text() function of outputformatters.py:

    def convert_to_text(self):
        txts = []
        for node in list(self.get_top_node()):  # Bug!!!!
            try:
                txt = self.parser.getText(node)

If you use the following code to output txt, the order is correct ( it just doesn't wrap the line correctly), but if you use the for loop above, it will be out of order. txt = self.parser.getText(self.get_top_node())