Converting RawContent to a byte stream and writing to disk corrupts the data

GoogleCodeExporter commented 9 years ago

User wrote....

I'm creating a spider that needs to be able to save binary data from the 
spider.  the args.CrawledPage.RawContent seems to contain the information but 
when i convert it to a byte stream and write it to disk it corrupts the data. 
any ideas on how to fix this?

Sample Code:
                    if (page.Contains(".pdf"))
                    {
                        string fName1 = string.Format(@"{0}Spider\s{1}_{2}_{3}.pdf", System.IO.Path.GetTempPath(), id, DateTime.Now.Minutes *100 + DateTime.Now.Seconds );
                        using (System.IO.FileStream stream = new System.IO.FileStream(fName1, System.IO.FileMode.Create))
                        {
                            System.IO.StreamWriter writer = new System.IO.StreamWriter(stream);
                            writer.Write(args.CrawledPage.RawContent);
                            writer.Flush();
                            stream.Flush();
                        }
                        try
                        {
//                            //PDF processing code here (opens file from disk 
and converts to text string
//
                        }
                        catch (Exception ex)
                        {
//                           // Error handling code omitted
                        }
                    }

Original issue reported on code.google.com by sjdir...@gmail.com on 18 Jul 2013 at 4:23

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 18 Jul 2013 at 4:27

Added labels: Milestone-Release1.2

GoogleCodeExporter commented 9 years ago

Do you have a test site for this?

Original comment by ilushk...@gmail.com on 19 Jul 2013 at 2:29

GoogleCodeExporter commented 9 years ago

I do not, i asked the person that reported this issue for some test urls
and i heard nothing back. I was going to just try it on any zip file i find
on the net and see if it is a universal problem or just with his specific
files. I suspect that we may need to add a CrawledPage.RawBytes property
that is filled by the IPageRequester the same way the
CrawledPage.RawContent is filled. That is only if we determine that
CrawlPage.RawContent doesn't play well with streams for whatever reason.

Original comment by sjdir...@gmail.com on 19 Jul 2013 at 7:13

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 3 Sep 2013 at 1:50

Removed labels: Milestone-Release1.2

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 3 Sep 2013 at 2:52

Added labels: Milestone-Release1.2.3

GoogleCodeExporter commented 9 years ago

Added auto encoding and CrawledPage.Content.Bytes which should allow data to be 
writtent to file stream without corruption.

Original comment by sjdir...@gmail.com on 17 Sep 2013 at 2:43

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Using v1.2.3.

Like the previous poster, I am attempting to use Abot to pull a series of PDF 
files and write them to disk.

When I run fiddler, it shows that the response for the Abot crawl and the 
response for downloading the file in the browser have the same content length. 
The same number also appears when I check 
CrawledPage.HttpResponse.ContentLength.

However, when I test CrawledPage.Content.Bytes for its length, I get a number 
about 80% higher (230,818 vs 410,906 for my test item - 
http://www.sbcounty.gov/parcelmaps/0130I1.pdf).

Examining the direct download vs the the file written by the Abot-enabled 
application, there are small portions of the file look identical, but most 
sections are different both in length and data composition.

I've tried running the CrawledPage.Content.Bytes array through the other 
encodings available through System.Text.Encodings, but that hasn't brought me 
any luck.

Any ideas?

Thank you for your assistance in this matter.

Original comment by joelw...@gmail.com on 4 Sep 2014 at 3:41

GoogleCodeExporter commented 9 years ago

Thank you bringing this back up. Fixed the issue with my last checkin and the 
patched version on nuget is 1.2.3.1031. You should now be able to save the raw 
bytes to disk using something like...

File.WriteAllBytes("whatever.pdf", crawledPage.Content.Bytes);

Please let me know if you are still having issues.

Original comment by sjdir...@gmail.com on 4 Sep 2014 at 6:12

GoogleCodeExporter commented 9 years ago

Patch works great. Thanks for the quick response!

Original comment by joelw...@gmail.com on 4 Sep 2014 at 3:55

PiRSquared17 / abot

Converting RawContent to a byte stream and writing to disk corrupts the data #113