ISUgenomics / SequelTools

new repo
GNU General Public License v3.0
26 stars 6 forks source link

Issue with sample names #7

Open mldmort opened 3 years ago

mldmort commented 3 years ago

Hi,

I'm running SequelTools for 8 CLR samples. I'm giving the sample names with -u subfiles.txt option. In the subfiles.txt file I put the address of the bam files. This is my command: SequelTools.sh -t Q -u subFiles.txt -n 12 -p a -g a -o $OUT_DIR I am getting weird plots for my stats with the same name for each bam file. A sample plot is attached. Also the summaryTable.txt looks like this with the same number for all samples:

SMRTcell    numReadsSubread numReadsLongestSub  totalBasesSubread   totalBasesLongestSub    meanReadLenSubread  meanReadLenLongestSub   medianReadLenSubread    medianReadLenLongestSub n50Subread  n50LongestSub   l50Subread  l50LongestSub   PSR ZOR
oasis   1320271 181528  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.137
oasis   2578421 377887  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.147
oasis   2252172 320325  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.142
oasis   2320629 335461  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.145
oasis   2266229 324966  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.143
oasis   2165289 302979  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.140
oasis   4398328 638727  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.145
oasis   2499748 348122  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.139

Would you let me know what's wrong? Thanks n50s.pdf

DavidEHufnagel commented 3 years ago

Hello,

Thank you for using SequelTools! Subfiles.txt should be a file-of-filenames, which it sounds like it is in your case. These filenames are what determines the name of each SMRTcell in the output. Are your files all named oasis.bam? If so, changing those names to unique identifiers should resolve the issue. Let me know if that works for you.

Best, Dr. David E. Hufnagel

On Tue, Oct 20, 2020 at 7:27 PM mldmort notifications@github.com wrote:

Hi,

I'm running SequelTools for 8 CLR samples. I'm giving the sample names with -u subfiles.txt option. In the subfiles.txt file I put the address of the bam files. This is my command: SequelTools.sh -t Q -u subFiles.txt -n 12 -p a -g a -o $OUT_DIR I am getting weird plots for my stats with the same name for each bam file. A sample plot is attached. Also the summaryTable.txt looks like this with the same number for all samples:

SMRTcell numReadsSubread numReadsLongestSub totalBasesSubread totalBasesLongestSub meanReadLenSubread meanReadLenLongestSub medianReadLenSubread medianReadLenLongestSub n50Subread n50LongestSub l50Subread l50LongestSub PSR ZOR oasis 1320271 181528 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.137 oasis 2578421 377887 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.147 oasis 2252172 320325 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.142 oasis 2320629 335461 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2266229 324966 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.143 oasis 2165289 302979 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.140 oasis 4398328 638727 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2499748 348122 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.139

Would you let me know what's wrong? Thanks n50s.pdf https://github.com/ISUgenomics/SequelTools/files/5412390/n50s.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3LRDVARAAX6TUBYSYLSLYTGRANCNFSM4SZBSWCA .

mldmort commented 3 years ago

Hi,

my Subfiles.txt contain:

/projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I thought that the names come from the bam files but it doesn't seems to. The name oasis appears in the output directory in the -o option: -o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

I don't know why oasis is chosen for the name of all the files and why the stats of the last file is chosen for all the cases. So I checked and it turns out that the stats in summaryTable.txt for all samples correspond to the last file.

Any idea why it happens? Thank,

DavidEHufnagel commented 3 years ago

Hey Arun,

I hope you can see the whole conversation here. I'm a little perplexed by this problem. Do you have some ideas as to what's causing these issues?

Let me know, Best, David

On Wed, Oct 21, 2020 at 12:28 PM mldmort notifications@github.com wrote:

Hi,

my Subfiles.txt contain:

/projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I thought that the names come from the bam files but it doesn't seems to. The name oasis appears in the output directory in the -o option: -o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

I don't know why oasis is chosen for the name of all the files and why the stats of the last file is chosen for all the cases. So I checked and it turns out that the stats in summaryTable.txt for all samples correspond to the last file.

Any idea why it happens? Thank,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7#issuecomment-713734921, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3OXPVQ7N2LG3524SD3SL4K5TANCNFSM4SZBSWCA .

aseetharam commented 3 years ago

@mldmort from first glance, it looks like the -- in the file name is causing something unintended, can you please try it one more time renaming the bam files without double dash?

DavidEHufnagel commented 3 years ago

Did this resolve the issue mldmort?

On Wed, Oct 21, 2020 at 2:12 PM Arun Seetharam notifications@github.com wrote:

@mldmort https://github.com/mldmort from first glance, it looks like the -- in the file name is causing something unintended, can you please try it one more time renaming the bam files without double dash?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7#issuecomment-713816023, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3PNCQDWGBR6AEQ47ZLSL4W75ANCNFSM4SZBSWCA .

mldmort commented 3 years ago

Hi David,

No, I have used symbolic links to point to my bam files to see if it solves the problem. So my new subfiles.txt file looks like:

ACI.bam
BN.bam
BUF.bam
F344.bam
MR.bam
MS20.bam
WKY.bam
WN.bam

And the files link to the original bam files like:

ACI.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam
BN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam
BUF.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam
F344.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam
MR.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam
MS20.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam
WKY.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam
WN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I don't know if linking would be sufficient or not but maybe the next step is to change the original file name? but the name oasis which appears in the plots most probably come from the -o option:

-o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

That's the only place the name oasis appears. Also the summaryTable.txt is still flawed with the same numbers for each row:

SMRTcell    numReadsSubread numReadsLongestSub  totalBasesSubread   totalBasesLongestSub    meanReadLenSubread  meanReadLenLongestSub   medianReadLenSubread    medianReadLenLongestSub n50Subread  n50LongestSub   l50Subread  l50LongestSub   PSR ZOR
oasis   1320271 181528  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.137
oasis   2320629 335461  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.145
oasis   2252172 320325  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.142
oasis   2165289 302979  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.140
oasis   2578421 377887  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.147
oasis   2266229 324966  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.143
oasis   4398328 638727  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.145
oasis   2499748 348122  21082583975 3794848484  8434    10901   8317    9856    9304    11125   885174  122214  0.180   0.139

Any suggestions? Thanks,

DavidEHufnagel commented 3 years ago

Yes, I believe you will have to change the original names. I am doing additional testing for a demonstration of SequelTools I will be doing next week and unfortunately I'm finding that the required format for the names of the input files is quite rigid. It has to be something like this, "ID.scraps.bam" or "ID.subreads.bam", where ID is usually something like this, " m54138_180610_050652". That has been the structure of all the files I've seen come directly from PacBio sequencing machines. This software was published just this month and we are getting lots of feedback now on issues we did not come across before. You can expect updates coming in the next few weeks to make SequelTools more flexible and to resolve identified bugs and issues.

Best, David

On Thu, Oct 22, 2020 at 11:32 AM mldmort notifications@github.com wrote:

Hi David,

No, I have used symbolic links to point to my bam files to see if it solves the problem. So my new subfiles.txt file looks like:

ACI.bam BN.bam BUF.bam F344.bam MR.bam MS20.bam WKY.bam WN.bam

And the files link to the original bam files like:

ACI.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam BN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam BUF.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam F344.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam MR.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam MS20.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam WKY.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam WN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I don't know if linking would be sufficient or not but maybe the next step is to change the original file name? but the name oasis which appears in the plots most probably come from the -o option:

-o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

That's the only place the name oasis appears. Also the summaryTable.txt is still flawed with the same numbers for each row:

SMRTcell numReadsSubread numReadsLongestSub totalBasesSubread totalBasesLongestSub meanReadLenSubread meanReadLenLongestSub medianReadLenSubread medianReadLenLongestSub n50Subread n50LongestSub l50Subread l50LongestSub PSR ZOR oasis 1320271 181528 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.137 oasis 2320629 335461 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2252172 320325 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.142 oasis 2165289 302979 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.140 oasis 2578421 377887 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.147 oasis 2266229 324966 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.143 oasis 4398328 638727 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2499748 348122 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.139

Any suggestions? Thanks,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7#issuecomment-714613307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3MKL7GTM3ROYLACJCTSMBNCNANCNFSM4SZBSWCA .