Mathelab / ALTRE

ALTered Regulatory Elements
http://mathelab.github.io/ALTRE/
9 stars 8 forks source link

error reading peak files #49

Closed catch-minie closed 6 years ago

catch-minie commented 7 years ago

Using loadBedFiles() function to read peak files I am facing error

"Error in loadBedFiles("csv_data") : csvfile must be a data.frame " Can anyone tell me what could be the error?

rfarouni commented 7 years ago

Did the previous step produce an error? What do you get when you run the following

csvfile <- loadCSVFile("mydata.csv")
csvfile

where "mydata.csv" is the metadata file that you need to create first. The vignette has more information if you would like more clarification regarding this step

catch-minie commented 7 years ago

Hi, thanks for your reply. I am following vignette only. The previous step gave me proper results and output it could read the csv file and display the columns

I have no idea why it is happening.

catch-minie commented 7 years ago

this command is throwing the error mentioned above samplePeaks <- loadBedFiles(csvfile) samplePeaks

rfarouni commented 7 years ago

can you post the output you get when you run this line?

str(csvfile)
catch-minie commented 7 years ago
Output of csvfile:
  bamfiles                 peakfiles       sample replicate

1 S3.bam S3.bed.zip S3 1 2 S7.bam S7.bed.zip S7 2 3 S1.bam S1_bed.zip S1 1 4 S5.bam S5.bed.zip S5 2 5 S4.bam S4.bed.zip S4 1 6 S8.bam S8.bed.zip S8 2 7 S2.bam S2.bed.zip S2 1 8 S6.bam S6.bed.zip S6 2

Output of next step samplePeaks = loadBedFiles(csvdata) Error in names(bedFiles) <- paste(csvfile$sample, csvfile$replicate, sep = "") : 'names' attribute [8] must be the same length as the vector [0]

samplePeaks Error: object 'samplePeaks' not found

rfarouni commented 7 years ago

The file was not formatted correctly. Make sure there are commas separating the fields, similar to what you see here

catch-minie commented 7 years ago

@ Mathelab

I installed it today. I am trying on my own data directly I havent tried it on the sample data

I understand your point that words may not help you to understand I have posted the output of the steps in my above comment. Hope it may help you to understand probelm.

catch-minie commented 7 years ago

@rfarouni it is saved in the form of csv only. I am pasting the output from command line so its unformatted.

Output of str(csv_data)

'data.frame': 8 obs. of 4 variables: $ bamfiles : Factor w/ 8 levels "S1.bam",..: 3 7 1 5 4 8 2 6 $ peakfiles: Factor w/ 8 levels "S7.bed.zip",..: 2 1 4 3 5 6 7 8 $ sample : Factor w/ 8 levels "S3","S7",..: 1 2 3 4 5 6 7 8 $ replicate: int 1 2 1 2 1 2 1 2

rfarouni commented 7 years ago

@kminie We apologize for the inconvenience this might have caused you. An CSV file is just a text file. The suffix ".csv" does not determine the format, but rather the commas that separate the fields. A quick solution would be to use the template I linked you to and change the names of the files to the names of the files you have. I hope that this help solve the problem.

catch-minie commented 7 years ago

Hi I have used the file you have linked but still same problem. I feel probably it is probllem with loadBedFiles() function any idea how to check that or alternative for it ??

catch-minie commented 7 years ago

Hi!

Just to update you I tried with the samplesheet what you have linked. But it seems to be not working with that as well. Here is the output:

csv_data = read.csv('DNaseEncodeExample.csv',header=T,sep=',') csv_data bamfiles peakfiles sample replicate 1 A549_ENCFF001CLE_chr21.bam A549_ENCFF001WCZ.bed.gz A549 I 2 A549_ENCFF001CLJ_chr21.bam A549_ENCFF001WDA.bed.gz A549 II 3 SAEC_ENCFF001EFI_chr21.bam SAEC_ENCFF001WSH.bed.gz SAEC I 4 SAEC_ENCFF001EFN_chr21.bam SAEC_ENCFF001WSI.bed.gz SAEC II samplePeaks = loadBedFiles(csvdata) Error in names(bedFiles) <- paste(csvfile$sample, csvfile$replicate, sep = "") : 'names' attribute [4] must be the same length as the vector [0] samplePeaks Error: object 'samplePeaks' not found str(csv_data) 'data.frame': 4 obs. of 4 variables: $ bamfiles : Factor w/ 4 levels "A549_ENCFF001CLE_chr21.bam",..: 1 2 3 4 $ peakfiles: Factor w/ 4 levels "A549_ENCFF001WCZ.bed.gz",..: 1 2 3 4 $ sample : Factor w/ 2 levels "A549","SAEC": 1 1 2 2 $ replicate: Factor w/ 2 levels "I","II": 1 2 1 2

Pardon me for unformatted text.

Mathelab commented 7 years ago

Thanks for providing the full output! I believe the problem is that you are reading the the csv file to an object by yourself using the "read.csv" function that exists in R. We need you to read in the csv file using our function "loadCSVFile". They do look similar, but ours does behind-the-scenes work that allows downstream functions to work properly. Let us know if you have more questions or it doesn't work,

Liz (baskineliz)

catch-minie commented 7 years ago

Hi Liz !

Thanks for your reply , the function loadCSVFile has worked with the test data and but with my own data set its throwing following error:

csv <- loadCSVFile('sample_sheet.csv') Error in loadCSVFile("sample_sheet.csv") : If there are greater than two sample types in the file, two samples must be selected for analysis using the sample1 and sample2 parameters. This software can only compare two samples at a time.

What it is referring here with only two samples ?

osubmi784323 commented 7 years ago

1 S3.bam S3.bed.zip S3 1 2 S7.bam S7.bed.zip S7 2 3 S1.bam S1_bed.zip S1 1 4 S5.bam S5.bed.zip S5 2 5 S4.bam S4.bed.zip S4 1 6 S8.bam S8.bed.zip S8 2 7 S2.bam S2.bed.zip S2 1 8 S6.bam S6.bed.zip S6 2

It looks like you have 8 different samples listed in your csv file (S1 to 8). We only support two samples at a time. If you look in the example csv file there are four rows, but in the sample column there are only two kinds of values: A549 or SAEC. Are S3, S1, S4, and S2 all one sample with four replicates? The same for S7, S5, S8, and S6? If that is true I would reformat your csv like this:

1 S3.bam S3.bed.zip S1 1 2 S7.bam S7.bed.zip S2 2 3 S1.bam S1_bed.zip S1 1 4 S5.bam S5.bed.zip S2 2 5 S4.bam S4.bed.zip S1 1 6 S8.bam S8.bed.zip S2 2 7 S2.bam S2.bed.zip S1 1 8 S6.bam S6.bed.zip S2 2

Basically, the sample column is a way to group samples together. It is not the literal sample name you are using to identify each individual sample you are analyzing.

Liz

catch-minie commented 7 years ago

I got your point. I understand the current limitation of using two samples at a time , so in my case I have to split the 8 samples into 4-4 and run as I have S1, S2, S3,S4 . But there is no problem with that. So I tried with two samples at a time and I am again getting following error:

> library(ALTRE)
> csv <- loadCSVFile('SampleSheet.csv')
> csv
# A tibble: 4 × 5
             bamfiles       peakfiles sample replicate                                                                                 
                <chr>           <chr>  <chr>     <chr>                                                                                    
1 S1_combined_new.bam  S1_peak.bed.gz     S1         I 
2     S2_combined.bam S2_peaks.bed.gz     S2       I 
3     S5_combined.bam  S5_peak.bed.gz     S1        II
4     S6_combined.bam  S6_peak.bed.gz     S2        II 
> samplePeaks <- loadBedFiles(csv)
Parsed with column specification:
cols(
  X1 = col_character()
)
Error: subscript contains NAs or out-of-bounds indices

I am attaching the sample sheet here for your reference.

SampleSheet.xlsx

osubmi784323 commented 7 years ago

I think it's a problem with your bed file format. Are they formatted like this?: http://useast.ensembl.org/info/website/upload/bed.html We require the first three columns only, separated by tabs. The columns are: chr, start, stop.

Can I see the first few lines? Use the read.table command combined with the head command like this:

head(read.table("A549_ENCFF001WCZ.bed.gz")) V1 V2 V3 V4 V5 V6 V7 V8 V9 1 chr1 10240 10349 . 0 . 6.67949 4.52317 -1 2 chr1 237718 237872 . 0 . 14.14010 13.65110 -1 3 chr1 564495 564831 . 0 . 37.23690 139.09800 -1 4 chr1 565252 566084 . 0 . 33.17420 117.73700 -1 5 chr1 566585 567276 . 0 . 37.82960 157.88400 -1 6 chr1 567451 567873 . 0 . 1546.21997 324.00000 -1

catch-minie commented 7 years ago

The format for bed file is three columns only, as you have mentioned: Here is the output of bed file, few lines:

head(read.table("S6_peak.bed.gz")) V1 V2 V3 1 Scaffold100 13393 13394 2 Scaffold100 33433 33434 3 Scaffold100 40551 40552 4 Scaffold100 69243 69244 5 Scaffold100 92335 92336 6 Scaffold100 96863 96864

Here is the output of code I have till the samplePeak step:

` csv <- loadCSVFile('SampleSheet.csv') samplePeaks <- loadBedFiles(csv) Parsed with column specification: cols( X1 = col_character(), X2 = col_integer(), X3 = col_integer() ) Parsed with column specification: cols( X1 = col_character(), X2 = col_integer(), X3 = col_integer() ) Parsed with column specification: cols( X1 = col_character(), X2 = col_integer(), X3 = col_integer() ) Parsed with column specification: cols( X1 = col_character(), X2 = col_integer(), X3 = col_integer() )

samplePeaks GRangesList object of length 4: $S1_I GRanges object with 81705 ranges and 2 metadata columns: seqnames ranges strand | sample replicate

| [1] Scaffold100 [13412, 13412] * | S1 I [2] Scaffold100 [40552, 40552] * | S1 I [3] Scaffold100 [69371, 69371] * | S1 I [4] Scaffold100 [92479, 92479] * | S1 I [5] Scaffold100 [96879, 96879] * | S1 I ... ... ... ... . ... ... [81701] chr9_10S [104515510, 104515510] * | S1 I [81702] chr9_10S [104574961, 104574961] * | S1 I [81703] chr9_10S [104575286, 104575286] * | S1 I [81704] chr9_10S [104590811, 104590811] * | S1 I [81705] chr9_10S [104595820, 104595820] * | S1 I ... 3 more elements> seqinfo: 5008 sequences from an unspecified genome; no seqlengths consensusPeaks <- getConsensusPeaks(samplepeaks = samplePeaks,minreps = 2) Warning messages: 1: In GenomeInfoDb::keepSeqlevels(finalgranges, chrom_subset) : invalid seqlevels'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'were ignored 2: In GenomeInfoDb::keepSeqlevels(finalgranges, chrom_subset) : invalid seqlevels'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'were ignored plotConsensusPeaks(samplepeaks = consensusPeaks)`

Next step is where reference has to be defined , like in your sample code its SAEC, does that mean it is using SAEC as control here? and respectively I have to mention control sample here for my data?

osubmi784323 commented 7 years ago

Ok, great, thanks for the output. Does your bed file contain any lines were the chromosome is any of these?: 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'? The warning you are getting makes me think no.

Is this data human? Our program only works on human data and processes only the above chromosomes (there might be an update in the next week or so to include nonhuman samples, I'm working on it). I don't know what Scaffold100 or chr9_10S is.

The purpose of our software is to compare two samples and see how one changes in comparison to the other. In our example SAEC is normal lung tissue and A549 is cancerous. We care how the cancer tissue has changed in comparison to the normal so SAEC is the normal/control/reference. If you just had two cancer samples and wanted to see difference between them there might be no "obvious" control and you could just select one of the cancer samples randomly.

catch-minie commented 7 years ago

OK thanks I understand that , no it's not human data . I would be eagerly waiting for your updated version of software dealing with non human samples.it would be great to use my data on the tool to debug and find out the more requirements for ur tool If any. Please keep me updated regarding new version. Thanks

On Jan 20, 2017 5:30 PM, "baskineliz" notifications@github.com wrote:

Ok, great, thanks for the output. Does your bed file contain any lines were the chromosome is any of these?: 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'? The warning you are getting makes me think no.

Is this data human? Our program only works on human data and processes only the above chromosomes (there might be an update in the next week or so to include nonhuman samples, I'm working on it). I don't know what Scaffold100 or chr9_10S is.

The purpose of our software is to compare two samples and see how one changes in comparison to the other. In our example SAEC is normal lung tissue and A549 is cancerous. We care how the cancer tissue has changed in comparison to the normal so SAEC is the normal/control/reference. If you just had two cancer samples and wanted to see difference between them there might be no "obvious" control and you could just select one of the cancer samples randomly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Mathelab/ALTRE/issues/49#issuecomment-274115720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAuakbdZ-OEJWY63WUrRBwdHyjeD_tAks5rUOEegaJpZM4Lls-0 .

osubmi784323 commented 7 years ago

Ok, you're the second person to ask so it's a priority :). I will update you early next week.

Mathelab commented 7 years ago

Hello,

What species are you looking into? Do you have an annotation file you could send us?

From: kminie notifications@github.com<mailto:notifications@github.com> Reply-To: Mathelab/ALTRE reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, January 20, 2017 at 1:09 PM To: Mathelab/ALTRE ALTRE@noreply.github.com<mailto:ALTRE@noreply.github.com> Cc: Ewy Mathe Ewy.Mathe@osumc.edu<mailto:Ewy.Mathe@osumc.edu>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [Mathelab/ALTRE] error reading peak files (#49)

OK thanks I understand that , no it's not human data . I would be eagerly waiting for your updated version of software dealing with non human samples.it would be great to use my data on the tool to debug and find out the more requirements for ur tool If any. Please keep me updated regarding new version. Thanks

On Jan 20, 2017 5:30 PM, "baskineliz" notifications@github.com<mailto:notifications@github.com> wrote:

Ok, great, thanks for the output. Does your bed file contain any lines were the chromosome is any of these?: 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'? The warning you are getting makes me think no.

Is this data human? Our program only works on human data and processes only the above chromosomes (there might be an update in the next week or so to include nonhuman samples, I'm working on it). I don't know what Scaffold100 or chr9_10S is.

The purpose of our software is to compare two samples and see how one changes in comparison to the other. In our example SAEC is normal lung tissue and A549 is cancerous. We care how the cancer tissue has changed in comparison to the normal so SAEC is the normal/control/reference. If you just had two cancer samples and wanted to see difference between them there might be no "obvious" control and you could just select one of the cancer samples randomly.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Mathelab/ALTRE/issues/49#issuecomment-274115720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAuakbdZ-OEJWY63WUrRBwdHyjeD_tAks5rUOEegaJpZM4Lls-0 .

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mathelab_ALTRE_issues_49-23issuecomment-2D274139458&d=DwMFaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=kwZD24MMCbG_sisYwGVpukmuRHYOGbXk10phc-LvGu4&m=PQw7WKYK9-5r1w5yaggfZTgV1dRSY3F9Y9PnnRDSeBI&s=ix90Kw7MBRTFvtJV7mct4CvlBJBC9DDMaycaCkxUIH0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOpx3Ur9mtxSTsZq0V6O5Er7kLYHZsUbks5rUPhLgaJpZM4Lls-2D0&d=DwMFaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=kwZD24MMCbG_sisYwGVpukmuRHYOGbXk10phc-LvGu4&m=PQw7WKYK9-5r1w5yaggfZTgV1dRSY3F9Y9PnnRDSeBI&s=9WKO6RrZ8Yrk41Sw-I77WVekQkGCJ1-_Pe2sosvowk8&e=.

catch-minie commented 7 years ago

Yes I have it, it's huge file, please right down your email id so that I can zip n share in Google drive

On Jan 20, 2017 7:12 PM, "Mathelab" notifications@github.com wrote:

Hello,

What species are you looking into? Do you have an annotation file you could send us?

From: kminie notifications@github.com<mailto:notifications@github.com> Reply-To: Mathelab/ALTRE <reply@reply.github.com<mailto: reply@reply.github.com>> Date: Friday, January 20, 2017 at 1:09 PM To: Mathelab/ALTRE <ALTRE@noreply.github.com<mailto:ALTRE@noreply.github. com>> Cc: Ewy Mathe Ewy.Mathe@osumc.edu<mailto:Ewy.Mathe@osumc.edu>, Comment < comment@noreply.github.commailto:comment@noreply.github.com> Subject: Re: [Mathelab/ALTRE] error reading peak files (#49)

OK thanks I understand that , no it's not human data . I would be eagerly waiting for your updated version of software dealing with non human samples.it would be great to use my data on the tool to debug and find out the more requirements for ur tool If any. Please keep me updated regarding new version. Thanks

On Jan 20, 2017 5:30 PM, "baskineliz" <notifications@github.com<mailto: notifications@github.com>> wrote:

Ok, great, thanks for the output. Does your bed file contain any lines were the chromosome is any of these?: 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'? The warning you are getting makes me think no.

Is this data human? Our program only works on human data and processes only the above chromosomes (there might be an update in the next week or so to include nonhuman samples, I'm working on it). I don't know what Scaffold100 or chr9_10S is.

The purpose of our software is to compare two samples and see how one changes in comparison to the other. In our example SAEC is normal lung tissue and A549 is cancerous. We care how the cancer tissue has changed in comparison to the normal so SAEC is the normal/control/reference. If you just had two cancer samples and wanted to see difference between them there might be no "obvious" control and you could just select one of the cancer samples randomly.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Mathelab/ALTRE/issues/49#issuecomment-274115720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAuakbdZ- OEJWY63WUrRBwdHyjeD_tAks5rUOEegaJpZM4Lls-0 .

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense. proofpoint.com/v2/url?u=https-3A__github.com_Mathelab_ALTRE_ issues_49-23issuecomment-2D274139458&d=DwMFaQ&c=k9MF1d71ITtkuJx- PdWme51dKbmfPEvxwt8SFEkBfs4&r=kwZD24MMCbG_sisYwGVpukmuRHYOGbXk10phc- LvGu4&m=PQw7WKYK9-5r1w5yaggfZTgV1dRSY3F9Y9PnnRDSeBI&s= ix90Kw7MBRTFvtJV7mct4CvlBJBC9DDMaycaCkxUIH0&e=, or mute the thread< https://urldefense.proofpoint.com/v2/url?u=https- 3A__github.com_notificationsunsubscribe-2Dauth AOpx3Ur9mtxSTsZq0V6O5Er7kLYHZsUbks5rUPhLgaJpZM4Lls-2D0&d= DwMFaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=kwZD24MMCbG_ sisYwGVpukmuRHYOGbXk10phc-LvGu4&m=PQw7WKYK9-5r1w5yaggfZTgV1dRSY3F9Y9PnnRDS eBI&s=9WKO6RrZ8Yrk41Sw-I77WVekQkGCJ1-_Pe2sosvowk8&e=>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Mathelab/ALTRE/issues/49#issuecomment-274140114, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAuars6hKTjs8bqMQcyu2QYEeo9qC9_ks5rUPkAgaJpZM4Lls-0 .

catch-minie commented 7 years ago

XL9.1_v3.genes.gff.gz Your mail is not working. attached here please find file

osubmi784323 commented 7 years ago

Oops sorry I edited it. Thanks! Working on it.

osubmi784323 commented 7 years ago

Ok I just updated the package and it is now able to process other organisms!

Here is the information on it: https://github.com/Mathelab/ALTREsampledata/tree/master/gtfManipulation

If after you read the above you you feel that you would like help converting your gff to a bed I can help but I will need some information:

Is every line in the attached gff a TSS that you want to use or do you only want to use some lines as TSS? For example only the CDS?

catch-minie commented 7 years ago

I want to use TSS for each line ... Also want to know is this separate package all together or I Hav to merge this with the old software version ?

catch-minie commented 7 years ago

Hi ! Again error :

` > library(ALTRE)

csvfile <- loadCSVFile("samplesheet.csv") Warning message: The following named parsers don't match the column names: datapath csvfile A tibble: 4 × 4 bamfiles peakfiles sample replicate

1 S1_filter.bam S1_peaks.bed.gz CC-3h I 2 S2_filter.bam S2_peak.bed.gz Ptf1a-3h I 3 S5_filter.bam S5_peaks.bed.gz CC-3h II 4 S6_filter.bam S6_peaks.bed.gz Ptf1a-3h II samplePeaks <- loadBedFiles(csvfile) Error in names(bedFiles) <- paste(csvfile$sample, csvfile$replicate, sep = "_") : 'names' attribute [4] must be the same length as the vector [0] In addition: Warning message: Unknown column 'datapath' `
osubmi784323 commented 7 years ago

Can you send me the csv?

catch-minie commented 7 years ago

SampleSheet.xlsx

catch-minie commented 7 years ago

Hi, could you manage to figure out the problem? looking forward for your reply

osubmi784323 commented 7 years ago

The csv can't be an xlsx file -- it must be a comma separated file. Otherwise, it looks ok. If after making it into a csv it still errors then I think you need to send me all your data if you would like me to try to work with it further.

Mathelab commented 6 years ago

Hello, I am closing this issue for now. Feel free to reopen an issue should you run into any other problems. Best, Ewy